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Chapter 1 


Introduction 


1.1 Scope 


This document describes the DECchip 21164-AA chip, a microprocessor that implements the 
Alpha architecture. This specification describes the external interface and programming infor- 
mation specific to the actual implementation. It does not describe the detailed implementation 
of the chip nor the Alpha architecture. The reader is referred to the Alpha System Reference 
Manual for the architectural specification. 


1.2 Chip Features 


The DECchip 21164-AA microprocessor is a CMOS-5 (.5 micron) super-scalar super-pipelined 
implementation of the Alpha architecture. It will be the basis of a family of Alpha products. 
The DECchip 21164-AA chip is designed to meet the requirements of a wide variety of systems, 
ranging from uni-processor workstations to multiprocessors. DECchip 21164-AA is intended to 
integrate well into a certain style of system environment, one with a particular kind of cache 
coherence protocol and a pipelined or lock-step style of bus and memory subsystem operation. 
A number of configuration options allow its use in a range of system designs ranging from ex- 
tremely simple systems with minimum component count to high-performance systems with very 
high cache and memory bandwidth. DECchip 21164-AA design compromises are made with 
the intention of achieving maximum performance in high-performance systems while offering 
competitive performance and reasonable implementation constraints in lower cost systems. 


DECchip 21164-AA features: 


e Alpha instructions to support byte, word, longword, quadword, DEC F_floating, G_floating 
and IEEE S_floating and T_floating data types. Limited support is provided for DEC D_ 
floating operations. Partial implementation of the architecturally optional instructions: 
FETCH and FETCH_M. 


¢ Demand paged memory management unit which in conjunction with properly written PALcode 
fully implements the Alpha memory management architecture appropriate to the operating 
system running on the processor. The translation buffer can be used with alternative PALcode 
to implement a variety of page table structures and translation algorithms. 


¢ On-chip 48-entry I-stream TB and 64-entry D-stream TB in which each entry maps one 8Kbyte 
page or a group of 8, 64, or 512. 8Kbyte pages, with the size of each TB entry’s group specified 
by hint bits stored in the entry. 
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¢ World class performance. 


¢ Low average cycles per instructions (CPI). The DECchip 21164-AA chip can issue four Alpha 
instructions in a single cycle, thereby minimizing the average CPI. A number of low-latency 
and/or high-throughput features in the instruction issue unit and the on-chip components of 
the memory subsystem further reduce the average CPI. 


¢ On-chip high-throughput floating point units, capable of executing both DEC and IEEE float- 
ing point data types. 
¢ On-chip 8Kbyte virtual instruction cache with seven-bit ASNs (MAX_ASN=127). 


¢ On-chip dual-read-ported 8Kbyte data cache (implemented as two 8Kbyte data caches con- 
taining identical data). 


¢ On-chip write buffer with six 32-byte entries. 

¢ On-chip 96Kbyte 3-way set associative writeback second level cache. 

¢ Bus interface unit, which contains logic to directly access an optional external third-level 
cache without CPU module action. The size and access time of the external third-level cache 
is programmable. 

¢ On-chip performance counters to measure and analyze CPU and system performance. 

¢ An instruction cache diagnostic interface to support chip and module level testing. 

e An internal clock generator which generates both a high-speed clock needed by the chip itself, 
and a pair of system clocks for use by the CPU module. 


¢ The DECchip 21164-AA chip is packaged in 503 pin IPGA packages. The heat sinks are 
separable and application specific. 


1.3 Terminology and Conventions 


1.3.1 Numbering 


All numbers are decimal unless otherwise indicated. Where there is ambiguity, numbers other 
than decimal are indicated with the name of the base following the number in parentheses, e.g., 


FF (hex). 


1.3.2 UNPREDICTABLE And UNDEFINED 


Throughout this specification, the terms UNPREDICTABLE and UNDEFINED are used. Their 
Meanings are quite different and must be carefully distinguished. One key difference is that 
only privileged software (that is, software running in kernel mode) may trigger UNDEFINED 
operations, whereas either privileged or unprivileged software may trigger UNPREDICTABLE 
results or occurrences. A second key difference is that UNPREDICTABLE results and occurrences 
do not disrupt the basic operation of the processor; the processor continues to execute instructions 
in its normal manner. In contrast, UNDEFINED operation may halt the processor or cause it to 
lose information. 


A result specified as UNPREDICTABLE may acquire an arbitrary value subject to a few con- 
straints. Such a result may be an arbitrary function of the input operands or of any state 
information that is accessible to the process in its current access mode. UNPREDICTABLE re- 
sults may be unchanged from their previous values. UNPREDICTABLE results must not be 
security holes. Specifically, UNPREDICTABLE results must not do any of the following: 
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¢ Depend on or be a function of the contents of memory locations or registers which are inac- 
cessible to the current process in the current access mode. 


¢ Write or modify the contents of memory locations or registers to which the current process in 
the current access mode does not have access. 


¢ Halt or hang the system or any of its components. 


For example, a security hole would exist if some UNPREDICTABLE result depended on the value 
of a register in another process, on the contents of processor temporary registers left behind by 
some previously running process, or on a sequence of actions of different processes. 


An occurrence specified as UNPREDICTABLE may happen or not based on an arbitrary choice 
function. The choice function is subject to the same constraints as are UNPREDICTABLE results 
and, in particular, must not constitute a security hole. 


Results or occurrences specified as UNPREDICTABLE may vary from moment to moment, imple- 
mentation to implementation, and instruction to instruction within implementations. Software 
can never depend on results specified as UNPREDICTABLE. 


Operations specified as UNDEFINED may vary from moment to moment, implementation to 
implementation, and instruction to instruction within implementations. The operation may vary 
in effect from nothing to stopping system operation. UNDEFINED operations must not cause the 
processor to hang, i.e., reach an unhalted state from which there is no transition to a normal state 
in which the machine executes instructions. Only privileged software (that is, software running 
in kernel mode) may trigger UNDEFINED operations. 


1.3.3 Data Field Size 


The term INTnn, where nn is one of 2, 4, 8, 16, 32, or 64, refers to a data field of nn contiguous 
naturally aligned bytes. INT4 refers to a naturally aligned longword, for example. 


1.3.4 Ranges And Extents 


Ranges are specified by a pair of numbers separated by a “.." and are inclusive, e.g., a range of 
integers 0..4 includes the integers 0, 1, 2, 3, and 4. 


Extents are specified by a pair of numbers in angle brackets separated by a colon and are inclusive, 
e.g., bits <7:3> specify an extent of bits including bits 7, 6, 5, 4, and 3. 


1.3.5 Register Format Notation 


This specification contains a number of figures that show the format of various registers, followed 
by a description of each field. In general, the fields on the register are labeled with either a name 
or a mnemonic. The description of each field includes the name or mnemonic, the bit extent, and 


the type. 


The “Type” column in the field description includes both the actual type of the field, and an 
optional initialized value, separated from the type by a comma. The type denotes the functional 
operation of the field, and may be one of the values shown in Table 1-1. If present, the initialized 
value indicates that the field is initialized by hardware to the specified value at powerup. If the 
initialized value is not present, the field is not initialized at powerup. 
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Table 1-1: 


Notation 


Register Field Type Notation 


Description 





' A read-write bit or field. The value may be read and written by software. 


A read-only bit or field. The value may be read by software. It is written by hardware; 
software writes are ignored. 


A write-only bit or field. The value may be written by software. It is used by hardware 
and reads by software return an UNPREDICTABLE result. 


A write bit or field. The value may be written by software. It is used by hardware and 
reads by software return a 0. 


A write-one-to-clear bit. If reads are allowed to the register then the value may be 
read by software. If it is a write-only register then a read by software returns an 
UNPREDICTABLE result. Software writes of a 1 cause the bit to be cleared by hard- 
ware. Software writes of a 0 do not modify the state of the bit. 


A write-zero-to-clear bit. If reads are allowed to the register then the value may 
be read by software. If it is a write-only register then a read by software returns 
an UNPREDICTABLE result. Software writes of a 0 cause the bit to be cleared by 
hardware. Software writes of a 1 do not modify the state of the bit. 


A write-anything-to-the-register-to-clear bit. If reads are allowed to the register then 
the value may be read by software. If it is a write-only register then a read by software 
returns an UNPREDICTABLE result. Software write of any value to the register cause 
the bit to be cleared by hardware. 


A read-to-clear field. The value is written by hardware and remains unchanged until 
read. The value may be read by software at which point, hardware may write a new 
value into the field. 


In addition to named fields in registers, other bits of the register may be labeled with one of the 
four symbols listed in Table 1-2. These symbols denote the type of the unnamed fields in the 
register. 


Table 1-2: Register Field Notation 
Notation 


1-4 


Introduction 


Description 
Fields specified as Read As Zero (RAZ) return a zero when read. 
Fields specified as Read As One (RAO) return a one when read. 


Fields specified as Ignore (IGN) are ignored when written and UNPREDICTABLE when 
read if not otherwise specified. 

Fields specified as Must Be Zero (MBZ) must never be filled by software with a non- 
zero value. If the processor encounters a non-zero value in a field specified as MBZ, a 
Reserved Operand exception occurs. 


Fields specified as Should Be Zero (SBZ) should be filled by software with a zero value. 
These fields may be used at some future time. Non-zero values in SBZ fields produce 
UNPREDICTABLE results. 
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1.4 Chip Summary 


Table 1-3: DECchip 21164-AA Chip Summary and Micro-architecture 


Feature 


Estimated Cycle Time Range 
Product Speed Bin Points 
Process Technology 
Transistor count 

Die Size 

Package 

No. Chip Pads 

No. Signal Pins 

Typ Maximum Power Dissipation 
Clocking input 

Virtual address size 
Physical address size 

Page size 

Issue rate 

Integer Pipeline 

Floating Pipeline 

On-chip Deache 

On-chip Icache 


On-chip Scache 
On-chip DTB 
On-chip ITB 


FPU 
Bus 
Serial ROM Interface 


Description 

4.Ans to 3.2nst 

To Be Determined 

CMOS5 (0.5 micron CMOS) and CMOSS5S (TBD micron CMOS) 


503 pin IPGA (interstitial pin grid array) 

581 

289 

approx. GOW @ 3.5ns cycles, Vdd=3.45V¢ 

two times the internal clock speed. E.g., 571.4 Mhz at a 3.5ns cycle time. 
43 bits 

40 bits 

8Kbytes 

4 instructions per cycle 

7 stage 

9 stage 

8Kbyte, physical, direct-mapped, write-thru, 32-byte block, 32-byte fill 


8Kbyte, virtual, direct-mapped, 32-byte block, 32-byte fill, 128 ASNs 
(MAX_ASN=127) 


96Kbyte, physical, 3-way set associative, writeback, 32 or 64-byte block, 
32 or 64-byte fill 


64-entry, fully-associative, NLU replacement, 8K pages, 128 ASNs (MAX_ 
ASN=127), full granularity hint support 


48-entry, fully-associative, NLU replacement, 128 ASNs (MAX_ASN=127), 
full granularity hint support 


On-chip FPU supports both IEEE and DEC floating point 
Separate data and address bus. 128-bit 
Allows the chip to access a serial ROM 


This range should not be interpreted as implying any particular production speed bin point. Speed bin ranges will not be 
known until characterization of production CMOS5 parts has been completed. The highest performance system designs should 
be designed to accept 3.2ns DECchip 21164-AA parts, though it is not known if or when production parts that fast will be 


available. 


Power consumption scales linearly wih frequency over the frequency range 225Mhz to 312Mhz. 
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1.5 Revision History. 


Table 1-4: Revision History 


Who 


JHE 
JHE 
JHE 


1-6 Introduction 


When 
9-Feb-1992 
1-March-1992 


29-November- 
1992 


Description of change 

Initial version. 

Add chip summary. Initial release. 
Updates for new revision. 
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Chapter 2 


DECchip 21164-AA Micro-Architecture 


2.1. Introduction 


This chapter gives a programmer and system designer a view of the DECchip 21164-AA micro- 
architecture. It is intended to be sufficient for almost all purposes. More detailed hardware 
descriptions of the chip exist in the internal specification and the behavioral model. 


DECchip 21164-AA can issue four instructions in a single cycle. Scheduling and issue rules are 
given at the end of the chapter. DECchip 21164-AA is a pipelined CPU with 4 Ibox! stages, 3 


integer operate stages and 4 floating point operate stages. The pipeline is presented later in this 
chapter. 


The combination of DECchip 21164-AA and its PALcode implements the Alpha architecture. 
Parts of the hardware design assume specific PAL functionality. This functionality is described 
in the next chapter. If a certain piece of hardware is “architecturally incomplete", the missing 
functionality must be implemented in PALcode. 


2.2 Overview 


The DECchip 21164-AA microprocessor consists of five functional units: 


¢ The Ibox fetches, decodes, and issues instructions. It manages the pipelines (data bypassing), 
the PC, instruction caching (Icache), prefetching, and instruction stream memory manage- 
ment. It also contains interrupt and trap handling hardware. 

* The Ebox contains the two integer execution units which execute all integer instructions. It 
also partially executes all memory instructions by calculating the effective address, if there 
is one. 

¢ The Mbox processes all load and store operations after the Ebox produces the address. It 
implements data stream memory management, executes loads, stores, the memory barrier 
instruction, and some other instructions. It manages outstanding load misses, the write 
buffer, and the data cache (Dcache). It enforces any reference ordering required for correct 


operation or by the Alpha shared memory model. It also buffers physical instruction stream 
requests sent by the Ibox. 


1 The Ibox is the unit which fetches, decodes and issues instructions. 
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¢ The Chox processes all accesses sent by the Mbox and implements all memory related external 


interface functions, particularly the coherence protocol functions for writeback caching. It 
controls the Scache, a 96 Kbyte, 3 way set-associative, writeback, data and instruction cache. 
The Chox also manages the optional external direct. mapped Beache. It handles all instruction 
and data primary cache read misses, performs the function of writing data from the write 
buffer into the shared coherent memory subsystem, and has a major role in executing the 
memory barrier instruction. 


The Fbox contains the two floating point execution units, one which basically executes float- 
ing multiply instructions and another which executes all other floating point instructions, 
particularly floating point add and subtract. Both units execute the CPYS instruction. 


The Ebox and Fhox can each accept one or two instructions per cycle. If code is properly scheduled, 
DECchip 21164-AA can issue up to four instructions per cycle. 


Figure 2-1 is a block diagram of DECchip 21164-AA showing the major functional elements and 
their positions in the pipeline. 
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Figure 2-1: Abstract CPU Block Diagram 
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2.3 


ibox 


The primary function of the Ibox is to issue instructions to the Ebox, Mbox and Fbox. In order to 
provide those instructions, the Ibox also contains the prefetcher, PC pipeline, 48-entry ITB, abort 
logic, register conflict or dirty logic, and interrupt and exception logic. The Ibox decodes four 
instructions in parallel and checks that the required resources are available for each instruction. 
If resources are available and multiple issue is possible, then up to four instructions may be 
issued. Section 2.10 give the detailed rules governing multiple instruction issue. The Ibox issues 
only the instructions for which all required resources are available. The Ibox does NOT issue 
instructions out of order, even if the resources are available for a later instruction and not for an 
earlier one. 


The Ibox controls the primary instruction cache, the Icache. See Section 2.8.2 for more detail. 


The Ibox does not advance to a new group of four instructions until all instructions in the current 
group have been issued. The Ibox only handles naturally aligned groups of four instructions 
(INT 16). If a branch to the middle of such a group occurs, the Ibox attempts issuing the instruc- 
tions from the branch target to the end of the INT 16, proceeding to the next INT 16 of instructions 
only when all the instructions in the target INT16 have been issued. This implies that achieving 


. Maximum issue rate requires that code be be scheduled properly and NOPs (floating or integer) 
be used to fill empty slots in the schedule. 


2.3.1. Instruction Prefetch 


The Ibox contains an aggressive instruction prefetcher and a four entry prefetch buffer (called 
the refill buffer). Each Icache miss is checked in the refill buffer. If the refill buffer contains the 
instruction data, it fills the Icache and instruction buffer simultaneously. If the refill buffer does 
not contain the necessary data, a fetch and a number of prefetches are sent to the Mbox. If these 
requests are all Scache hits, it is possible for instruction data to stream into the Ibox at the rate 
of one INT16, four instructions, per cycle. The Ibox can sustain up to quad-instruction issue from 
this Scache fill stream, filling the Icache simultaneously. The refill buffer holds all returned fill 
data until the data is required by the Ibox pipeline. 


Each fill occurs when the instruction buffer stage in the Ibox pipeline requires a new INT16. 
The INT16 is written into the Icache and the instruction buffer simultaneously. This can occur 
at a maximum rate of one Icache fill per cycle. The actual rate depends on how frequently the 
instruction buffer stage requires a new INT16 and on availability of data in the refill buffer. 


Once an Icache miss occurs, the Icache enters fill mode. When it is both in fill mode and awaiting 
a fill, the Icache is checked for hit. If the instruction data is found in the Icache, the Icache returns 
to access mode and the prefetcher stops sending fetches to the Mbox. When a new PC is loaded 
(e.g., taken branches) the Icache returns to access mode until the first miss. The refill buffer 
receives and holds instruction data from prefetches initiated before before the Icache returned to 
access mode. 
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2.3.2 Branch Execution 


When a branch or jump instruction is fetched from the Icache, the Ibox takes one cycle to calculate 
the target PC before it is ready to fetch the target instruction stream. In the second cycle after 
the fetch, the Icache is accessed at the target address. Branch and PC prediction are necessary 
to predict and begin fetching the target instruction stream before the branch or jump instruction 
is issued. 


The Icache records the outcome of branch instructions in a two-bit history state provided for each 
instruction location in the cache. This information is used as the prediction for the next execution 
of the branch instruction. The history status is not initialized on Icache fill, so it may "remember" 
a branch which is evicted from the Icache and subsequently reloaded. 


DEC chip 21164-AA does not limit the number of branch predictions outstanding to one; it predicts 
branches even while waiting to confirm the prediction of previously predicted branches. There 
can be one branch prediction pending for each of stages 3 and 4 plus up to four in stage 2. 


When a predicted branch is issued, the Ebox or Fbhox checks the prediction. The branch history 
table is updated accordingly. On branch mispredict, a mispredict trap occurs and the Ibox restarts 
execution from the correct PC. 


_ DECchip 21164-AA provides a twelve-entry subroutine return stack which is controlled by de- 
coding the opcode (BSR, HW_REI and JMP/JSR/RET/JSR_COROUTINE), and disp<15:14> in 
JMP/JSR/RET/JSR_COROUTINE. The stack stores an Icache index in each entry. (Note that 
the stack is implemented as a circular queue which wraps around in the overflow and underflow 
cases.) 


DECchip 21164-AA uses the Icache index hint in the JMP and JSR instructions to predict the 
target PC. The Icache index hint in the instruction’s displacement field is used to access the direct 
mapped Icache. The upper bits of the PC are formed from the data in the Icache tag store at: 
that index. Later in the pipeline, the PC prediction is checked against the actual PC generated 
by the Ebox. A mismatch causes a PC mispredict trap and restart from the correct PC. This is 
similar to branch prediction. 


The RET, JSR_COROUTINE, and HW_REI instructions predict the next PC using the index from 
the subroutine return stack. The upper bits of the PC are formed from the data in the Icache tag 
at that index. These predictions are checked against the actual PC in exactly the same way that 
JMP and JSR predictions are checked. 


Note that changes from PAL mode to native mode and vice versa are predicted on all PC predic- 
tions that use the subroutine return stack. If the opcode isn’t HW_REI, this might not seem to 
make sense, but if the PC prediction is correct, the mode prediction will be as well. 


As noted above, Istream prefetching is disabled when a PC prediction is outstanding. 


2.3.3 ITB 


The Ibox contains a 48-entry fully associative translation buffer to cache recently used instruction- 
stream address translations and protection information for pages ranging from 8 Kbytes to 512 
Kbytes in size. The ITB uses a not-last-used replacement algorithm. The ITB is filled and main- 
tained by PALcode. Each entry supports all four granularity hint bit combinations permitting 
translation for up to 512 contiguously mapped 8 Kbyte pages using a single ITB entry. 
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Bes 


The operating system, via PALcode, must ensure that virtual addresses can only be mapped 
through a single ITB entry or super page mapping at one time. Multiple simultaneous mapping 
can cause UNDEFINED results. 


While not executing in PAL mode, the 43-bit virtual program counter (PC) is presented each cycle 
to the ITB. If the PTE associated with the PC is cached in the ITB, the protection bits for the 
page which contains the PC are used by the Ibox to do the necessary access checks. If there is 
an Icache miss and the PC is cached in the ITB, the PFN and protection bits for the page which 
contains the PC are used by the Ibox to do the address translation and access checks. 


The DECchip 21164-AA ITB supports 128 ASNs (MAX_ASN=127) via a seven-bit ASN field in 
each ITB entry. PALcode which supports writes to the architecturally-defined TBIAP register 
does so by using the hardware-specific HW_MTPR instruction to write to a specific hardware 
register. This has the effect of invalidating ITB entries which do not have their ASM bit set. 


DECchip 21164-AA provides two optional translation extensions referred to as super pages. They 
are enabled via ICSR<SPE>. One super page mapping maps virtual address bits <39:13> one-to- 
one to physical address bits <39:13> when virtual address bits <42:41> = 2. This maps the entire 
physical address space four times over to the quadrant of the virtual address space with virtual 
address bits <42:41> = 2. The second super page mapping maps virtual address bits <29:13> 
one-to-one to physical address bits <29:13> with physical address bits <39:30> set to 0. This 
mapping occurs for virtual addresses with bits <42:30> = 1FFE(Hex), mapping a 30-bit region of 
physical address space to a single region of the virtual address space defined by virtual address 
bits <42:30> = 1FFE(Hex). Access to either super page mapping is only allowed while executing 
in kernel mode. 


2.3.4 Interrupt Logic 


The DECchip 21164-AA chip supports three sources of interrupts: hardware, software and asyn- 
chronous system trap (AST). There are seven level-sensitive hardware interrupts sourced by 
pins, 15 software interrupts sourced by an on-chip IPR (SIRR), and 4 AST interrupts sourced by 
a second on-chip IPR (ASTRR). Interrupts are masked by the hardware interrupt priority level 
register (IPL). In addition, AST interrupts are qualified by the current processor mode and the 
performance counter interrupts, the serial line interrupt, and the internally-detected correctable 
error interrupt are all maskable by bits in the IPR, ICSR (see Chapter 3). All interrupts are 
disabled when the processor is executing PALcode. 


Table 2-1 shows which interrupts are enabled for a given IPL. An interrupt is enabled if the 
current IPL is less than the target IPL of the interrupt. 


Table 2-1: Interrupt Priority Level Effect 


Interrupt Source Target IPL (decimal) 
Software Interrupt Request 1 1 
Software Interrupt Request 2 2 
Software Interrupt Request 3 3 
Software Interrupt Request 4 4 
Software Interrupt Request 5 5 
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Table 2-1 (Cont.): Interrupt Priority Level Effect 
Interrupt Source : Target IPL (decimal) 
Software Interrupt Request 6 


6 
Software Interrupt Request 7 7 
Software Interrupt Request 8 8 
Software Interrupt Request 9 9 
Software Interrupt Request 10 10 
Software Interrupt Request i1 11 
Software Interrupt Request 12 12 
Software Interrupt Request 13 13 
Software Interrupt Request 14 14 
Software Interrupt Request 15 15 
AST pending (for current or more privileged mode) 2 
Performance counter interrupt 29 
Power fail interrupt§ 30 


System machine check interrupt§; Internally detected correctable error in- 31 
terrupt pending | 


External interrupt 20§ (I/O interrupt at IPL 20; corrected system error 20 
interrupt) 


External interrupt 21§ (I/O interrupt at IPL 21) 21 


External interrupt 22§ (/O interrupt at IPL 22; interprocessor interrupt; 22 
timer interrupt) 


External interrupt 23§ (I/O interrupt at IPL 23) 23 , 
Halt§ Masked only by executing in 
PAL mode. 


§These interrupts are from external sources. In some cases, the system environment provides the logic-or of multiple 
interrupt sources at the same IPL. 


When the processor receives an interrupt request and that request is enabled, an interrupt is 
reported or delivered to the exception logic if the processor is not currently executing PALcode. 
Before vectoring to the interrupt service PAL dispatch address, the pipeline is completely drained 
to the point that instructions issued before entering the PALcode can not trap (implied DRAINT). 


The restart address is saved in the Exception Address IPR (EXC_ADDR) and the processor enters 
PALmode. The cause of the interrupt may be determined by examining the state of the INTID 
and ISR registers. 


Note that hardware interrupt requests are level sensitive and therefore may be removed before 
an interrupt is serviced. PALcode must verify the interrupt actually indicated in INTID is to be 
serviced at an IPL higher that the current IPL. If it is not, PALcode should ignore the spurious 
interrupt. 
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2.3.5 Performance Counters 


TBD FUNCTIONALITY 


We have yet to define our performance monitoring features completely. 


2.4 Ebox 


The Ebox contains two 64-bit integer execution pipelines, a total of 2 adders, 2 logic boxes, 1 barrel 
shifter, 1 byte zapper, and 1 integer multiplier. Almost all useful bypass paths are implemented; 
the result of any completed integer operation is available for use by instructions other than 
integer multiply issuing into either pipeline. (The integer multiplier is unable to recieve data 
from certain bypass paths. This is reflected in the latency specification at the end of this chapter.) 
The integer multiplier retires 8 bits per cycle. Table 2-9 lists all instruction latencies. The Ebox 
also contains the 40-entry 64-bit integer register file containing the 32 integer registers defined 
by the Alpha architecture and 8 PAL shadow registers. The register file has four read ports and 
two write ports which provide operands to both integer execution pipelines and accept results 
from both pipes. The register file also accepts load instruction results (memory data) on the same 
two write ports. Arbitration implemented by the Ibox reserves the write ports for fills from the 
Mbox when appropriate. 


2.5 Mbox 


The Mbox contains the address translation buffer for all loads and stores, the write buffer address 
file, the miss address file, the Dcache interface, and Mbox IPRs. It executes up to two loads 
per cycle, though a load can not be issued simultaneously with a store or certain other Mbox 
instructions (see Section 2.10 for detailed issue rules). The address translation datapath receives 
a virtual address every cycle from each adder in the Ebox. A translation buffer with two read 
ports generates the corresponding physical addresses and access control information. 


2.5.1 Big Endian Support 


DECchip 21164-AA provides limited support for big endian data formats via MCSR<BIG_ 
ENDIAN>. When this bit is set, physical address bit <2> is inverted for all longword D-stream 
references. It is intended that this mode be set during initialization PALcode and not changed 
during operation. 


2.5.2 DTB 


DECchip 21164-AA contains a 64-entry fully associative dual read-ported translation buffer which 
caches recently used data-stream page table entries for 8 Kbyte pages. Each entry supports all 
four granularity hint bit combinations permitting translation for up to 512 contiguously mapped 
8 Kbyte pages using a single DTB entry. The translation buffer uses a not-last-used replacement 
algorithm. 
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The DECchip 21164-AA DTB supports 128 ASNs (MAX_ASN=127) via a seven-bit ASN field in 
each DTB entry. PALcode which supports writes to the architecturally-defined TBIAP register 
does so by using the hardware-specific HW_MTPR instruction to write to a specific hardware 
register. This has the effect of invalidating DTB entries which do not have their corresponding 
ASM bit set. 


For load and store instructions and other Mbox instructions requiring address translation, the 
effective 43-bit virtual address is presented to the DTB. If the PTE of the supplied virtual address 
is cached in the DTB, the PFN and protection bits for the page which contains the address are 
used by the Mbox to complete the address translation and access checks. 


DECchip 21164-AA provides two optional translation extensions referred to as super pages. They 
are enabled via MCSR<SP<1:0>>. One super page mapping maps virtual address bits <39:13> 
one-to-one to physical address bits <39:13> when virtual address bits <42:41> = 2. This maps 
the entire physical address space four times over to the quadrant of the virtual address space 
with virtual address bits <42:41> = 2. The second super page mapping maps virtual address bits 
<29:13> one-to-one to physical address bits <29:13> with physical address bits <39:30> set to 
0. This mapping occurs for virtual addresses with bits <42:30> = 1FFE(Hex), mapping a 30-bit 
region of physical address space to a single region of the virtual address space defined by virtual 
address bits <42:30> = 1FFE(Hex). Access to either super page mapping is only allowed while 
executing in kernel mode. 


The DTB is filled and maintained by PALcode. Figure 3-6 shows the DTB miss flow. In general, 
the operating system, via PALcode, must ensure that virtual addresses can only be mapped 
through a single DTB entry or super page mapping at one time. Multiple simultaneous mapping 
can cause UNDEFINED results. The only exception to this rule is that one virtual page may 
be mapped twice with identical data in two different DTB entries. This occurs in operating 
systems utilizing virtually accessible page tables like those used by VMS. If the level 1 page 
table is accessed virtually, PALcode ends up loading the translation information twice, once in 
the double-miss handler, and once again in the primary handler. The PTE mapping the level 1 
page table must remain constant during accesses to this page to meet this requirement. 


2.5.3 Replay Traps 


For implementation reasons, there are no stalls after the instruction issue point in the pipeline. 
For certain cases, an Mbox instruction can not be executed because of insufficient resources 
or some other reason. These instructions trap and the Ibox restarts their execution from the 
beginning of the pipeline. This is called a replay trap. Replay traps occur in the following cases: 


* Write buffer full when a store is executed and there are already six write buffer entries 
allocated. The trap occurs regardless of whether the entry would have merged in the write 
buffer. 


¢ A load issued in EO when all six miss address file entries are valid (not available) or a load 
issued in E1 when five of the six miss address file entries are valid. The trap occurs regardles 
of whether the load would have hit in the Dcache merged with a miss address file entry. 


¢ Alpha shared memory model order trap (Litmus test 1 trap): If a load issues that address- 
matches with any miss in the miss address file, the load is aborted via a replay trap regardless 
of whether the newly-issued load hits or misses in the Deache. The address match is precise 
except that it includes the case in which a longword access matches within a quadword access. 
This ensures that the two loads execute in issue order. 


DIGITAL RESTRICTED DISTRIBUTION DECchip 21164-AA Micro-Architecture 2-9 


DEC Chip 21164-AA (EV5 CPU) Specification, Revision 1.9, December 1992 


¢ Load-after-store trap: If a load is issued in the cycle immediately following a store that hits 
in the Deache, and both access the same memory location, a replay trap occurs. The address 
match is exact with respect to low order bits of the address, but it is TBD whether it ignores 
address bits <42:13>. 


¢ When a load is followed within one cycle by any instruction which uses the result of that 
load and the load misses in the Deache, the consumer instruction traps and is restarted 
from the beginning of the pipeline. This happens because the consumer instruction is issued 
speculatively while Deache hit is being evaluated. If the load misses in the Deache, the spec- 
ulative issue of the consumer instruction was incorrect. The replay trap brings the consumer 
instruction to the issue point before or simultaneously with the availability of fill data. 


2.5.4 Load Instruction Execution and the Miss Address File 


The Mbox begins execution of each load instruction by translating the virtual address and ac- 

cessing the Deache. Translation and Deache tag read occur in parallel. If the addressed location 

is found in the Deache (a hit), the data from the Dcache is formatted and written to either the 

integer or floating point register file. The formatting required depends on the particular load 

instruction executed. If the data is not found in the Deache (a miss), the address, target register 
- number, and formatting information are entered in the miss address file. 


‘ The miss address file (MAF) performs a load merging function. When a load miss occurs, each 
MAF entry is checked to see if it contains a load miss addressing the same Dcache (32 byte) block. 
If it does, and if certain merging rules are met, the new load miss is merged with an existing 
MAF entry. This allows the Mbox to service two or more load misses with one data fill from the 
Chox. The merging rules are as follows: 


¢ Merging only occurs if the new load miss addresses a different INTS8 from all loads previously 
entered or merged to that miss address file entry. 


* Merging only occurs if the new load miss is the same access size as the loads previously 
entered in that miss address file entry. I.e., quadword loads only merge with other quadword 
loads and longword loads only merge with other longword loads. 


¢ In the case of longword loads, address bit<2> must be the same. I.e., longword loads with even 
addresses merge only with other even longword loads and longword loads with odd addresses 
merge only with other odd longword loads. 


¢ The miss address file does not merge floating point and integer load misses in the same entry. 


¢ Merging is prevented for the MAF entry a certain number of cycles after the Scache access 
corresponding to the MAF entry begins. Merging is prevented for that entry only if the Scache 
access hits. The minimum number of cycles of merging is three, the cycle in which the first 
load is issued and the two subsequent cycles. This corresponds to the most optimistic case 
of a load miss being forwarded to the Scache without delay (accounting for the cycle saved 
by the bypass which sends new load misses directly to the Scache when there is nothing else 
pending). 


Note that merging is allowed for loads to non-cacheable space (physical address bit <39> = 1). At 
the pins, these reads will tell the system environment which INT32 is addressed and which INT8s 
within the INT32 are actually accessed. (Merging stops for a load to non-cacheable space as soon 
as the Cbox accepts the reference.) This permits the system environment to access only those 
INT8s actually requested by load instructions. For memory mapped INT4 registers, the system 
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environment must return the result of reading each register within the INT8 since DECchip 
21164-AA only indicates which INT8s are accessed, not the exact length and offset of the access 
within each INT8. Systems implementing memory mapped registers with side effects from reads 
should place each such register in a separate INT8 in memory. 


When merging does not occur, a new MAF entry is allocated for the new load miss. Merging is 
done for two loads issued simultaneously which both miss as if they were issued sequentially 
with the load from Ebox pipe EO, in effect, first. The Mbox sends a read request to the Cbox for 
each MAF entry allocated. 


A bypass is provided so that if the load issues in Ebox pipe E0, and no MAF requests are pending, 
that load’s read request is sent to the Cbox immediately. Similarly, if a load from Ebox pipe E1 
misses and there was no load instruction in EO at all, the E1 load miss is sent to the Cbhox 
immediately. In either case, the bypassed read request is aborted if the load hits in the Deache 
or merges in the MAF. 


There are six MAF entries for load misses and four more for Ibox instruction fetches and 
prefetches. Normally load misses are the highest priority Mbox request. 


If the MAF is full and a load issues in EO or if five of the six MAF entries are valid and a load 
issues in E1, an MAF full trap occurs causing the Ibox to restart execution with the load that 
caused the MAF overflow. When the load arrives at the MAF the second time, an MAF entry 
may have become available. If not, the MAF full trap occurs again. 


Eventually, the Cbox provides the data requested for a given MAF entry (a fill). If the fill is 
integer data (and not floating point data), the Chox requests that the Ibox allocate two consecutive 
"bubble" cycles in the Ebox pipelines. The first. bubble prevents any instruction from issuing. The. 
second bubble prevents only Mbox instructions (particularly loads and stores) from issuing. The 
fill uses the first bubble cycle as it progresses down the Ebox/Mbox pipelines to format the data 
and load the register file. It uses the second bubble cycle to fill the Dcache. : 


Referring to Figure 2-2, note that an instruction typically writes the register file in stage 6. 
Because there is only one register file write port per integer pipeline, a no-instruction bubble 
cycle is required to reserve a register file write port for the fill. Again refering to Figure 2—2, note 
that a load or store accesses the Dcache in the second half of stage 4 and the first half of stage 
5. The fill operation writes the Dcache, making it unavailable for other accesses at that time. 
Relative to the register file write, the Dcache (write) access for a fill occurs a cycle later than 
the Deache access for a load hit. This is because the fill data arrives just in time to be bypassed 
to the consuming instruction. Since only loads and stores use the Dcache in the pipeline, the 
second bubble reserved for a fill is a no-Mbox-instruction bubble. See Section 2.9 for more details 
of pipeline. 


The second bubble is a subset of the first bubble. When two fills are in consecutive cycles (as they 
are for Scache hit) then three total bubbles are allocated, two no-instruction bubbles followed by 
one no-Mbox-instruction bubble. Note that the bubbles are requested before it is known whether 
the Scache (and similarly, the Beache) will hit. In other words, bubble allocation is speculative. 


For fills from the Chox to floating point registers, no cycle is allocated. Loads which conflict in 
the pipeline with the fill are forced to miss. Stores which conflict in the pipeline force the fill to 
be aborted in order to keep the Deache available to the store operation. In all cases, the floating 
point register(s) are filled as dictated by the associated MAF entry. A single store can block up 
to four consecutive fills. (Note that the Fbox has separate write ports for fill data as is necessary 
for this fill scheme.) 
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Up to two floating or integer registers may be written for each Chox fill cycle. Fills deliver 32 
bytes in two cycles, two INT8s per cycle. The MAF merging rules ensure that there is no more 
than one register to write for each INTS, so there is a register file write port available for each 
INTS8. After appropriate formatting, data from each INTS8 is written into the integer or floating 
point register file provided there is a miss recorded for that INTS8. 


Loads misses are all checked against the write buffer contents for conflicts between new loads 
and previously issued stores. See Section 2.5.6 for more detail. 


LDL_L and LD@_L instructions always allocate a new MAF entry. No loads that follow a LDL_L 
or LDQ_L are allowed to merge with it. After LDL_L or LDQ_L is issued, the Ibox does not issue 
any more Mbox instructions until the Mbox has successfully sent the LDL_L or LDQ_L to the 
Chox. This guarantees correct ordering between a LDL_L or LDQ_L and a subsequent STL_C or 
STQ _C even if they access different addresses. 


2.5.5 Store Execution 


Stores execute in the Mbox by reading the Dcache tag store in the pipeline stage in which a load 
would read the Decache, checking for a hit in the next stage, and writing the Dcache data store if 
there is a hit in the second following pipeline stage. See Section 2.9 for pipeline details. 


Loads are not allowed to issue in the second cycle after a store (1 bubble cycle). Other instructions 
can be issued in that cycle. Stores can issue at the rate of one per cycle because stores streaming 
down the pipeline do not conflict in their use of resources (the Dcache tag store and Deache data 
store are the principal resources). However, a load uses the Dcache data store in the same early 
stage that it uses the Dcache tag store. Therefore a load would conflict with a store if it were 
issued in the second cycle after any store. Section 2.9 gives details on store execution in the 
pipeline. 


A load which is issued one cycle after a store in the pipeline creates a conflict if both access 
exactly! the same memory location; the store hasn’t updated the location when the load reads it. 
This conflict is handled by forcing the load to trap (a replay trap). The Ibox flushes the pipeline 
and restarts execution from the load instruction. By the time the load arrives at the Deache the 
second time, the conflicting store has written the Deache and the load is executed normally. 


It is recommended that software not load data immediately after storing it. The replay trap that 
is incurred is fairly expensive. The best. solution is to schedule the load to issue three cycles after 
the store. No issue stalls or replay traps will occur in that case. If the load is scheduled to issue 
two cycles after the store, it will be issue-stalled for one cycle for the reasons given above. This 
is not optimal but is much better than incurring a replay trap on the load. 


For three cycles during store execution, fills from the Cbox are not placed in the Dcache. Register 
fills are unaffected. There are conflicts which make it impossible to fill the Dcache in each of 
these cycles. Fills are prevented in cycles in which a store is in pipeline stage 4, 5, or 6. Note 
that this applies most strongly to fills of floating point data. Fills of integer data allocate bubble 
cycles such that an integer fill never conflicts with a store in pipeline stages 4 or 5. A store which 
would have conflicted in stage 4 or 5 is issue-stalled instead. 


1 Tt is TBD if this address check will include the most significant bits of the address. It will be precise over bits <12:0>. 
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If a store is stalled at the issue point for any reason, it interferes with fills just as if it had been 
issued. Again, this applies only to fills of floating point data. If, when a store issues, subsequent 
instructions at the issue point do not issue, then a "shadow" of the store remains in the pipeline 
latches at the issue point. The Mbox has special logic which detects that the stalled "shadow" 
of the store is not a new store and will never issue, so the store "shadow" is prevented from 
interfering with concurrent fills. 


For each store, a search of the MAF is done to detect load-before-store hazards. If a store is 
executed and a load of the same address is present in the MAF, two things happen: 


1. Bits are set in each conflicting MAF entry to prevent its fill from being placed in the Deache 
when it arrives and to prevent subsequent loads from merging with that MAF entry. 


2. Conflict bits are set with the store in the write buffer to prevent the store from being issued 
until all conflicting loads have been issued to the Chox. 


This ensures proper results from the loads and prevents incorrect data from being cached in the 
Deache. 


A check is done for each new store against stores in the write buffer that have already been sent 
to the Cbhox but have not been completed. This is described in the next section. 


2.5.6 Write Buffer and the WMB Instruction 


The write buffer address file is contained in the Mbox. The write buffer data store is contained 
in the Cbox. It contains six fully associative 32-byte entries. The purpose of the write buffer is 
to minimize the number of CPU stall cycles by providing a high bandwidth (but finite) resource 
for receiving store data. This is required since DECchip 21164-AA can generate store data at 
the peak rate of one INT8 every CPU cycle which is greater than the average rate at which the 
Scache can accept the data if Scache misses occur. 


In addition to store instructions (including HW_ST), STQ_C, STL_C, FETCH and FETCH_M 
instructions are also written into the write buffer and sent off-chip. Unlike stores, however, 
these write buffer-directed instructions are never merged into a write buffer entry with other 
instructions. 


A write buffer entry is invalid if it does not contain one of the commands listed above. 


The WMB instruction has a special effect on the write buffer. When it is executed, a bit is set in 
every write buffer entry containing valid store data that will prevent future stores from merging 
with any of the entries. Also, the next entry to be allocated is marked with a WMB flag. (Note 
that the entry marked with the WMB flag does not yet have any valid data in it). When an entry 
marked with a WMB flag is ready to issue to the Chox, it is not issued until every previously 
issued write is completely finished. This ensures correct ordering between stores issued before 
the WMB instruction and stores issued after it. 


Each write buffer entry contains a CAM for holding physical address bits <39:5>, 32 bytes of 
data, eight INT4 mask bits which indicate which of the eight INT4s in the entry contain valid 
data, and miscellaneous control bits. Among the control bits are a WMB flag, already described, 
and a no-merge bit which indicates the entry is closed to further merging. 
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Two entry pointer queues are associated with the write buffer, a free entry queue and a pending 
request queue. The free entry queue contains pointers to available invalid write buffer entries. 
The pending request queue contains pointers to valid write buffer entries that have not yet been 
issued to the Chox. The pending request queue is ordered in allocation order. 


Each time the write buffer is presented with a store instruction the physical address generated 
by the instruction is compared to the address in each valid write buffer entry that is open for 
merging. If the address is in the same INT32 as an address in a valid write buffer entry which 
also contains a store and the entry is open for merging, then the new store data is merged into 
that entry and the entry's INT4 mask bits are updated. If no matching address is found or all 
entries are closed to merging, then the store data is written into the entry at the top of the free 
entry queue, that entry is validated, and pointer to the entry is moved from the free entry queue 
to the pending request queue. Note this scheme does not maintain write ordering. 


When two or more entries are in the pending request queue, the Mbox requests that the Cbox 
process the write buffer entry at the head of the pending request queue. It then removes the 
entry from the pending request queue (without placing it in the free entry queue). When the 
Cbox has completely processed the write buffer entry, it notifies the Mbox and the now invalid 
write buffer entry is placed in the free entry queue. The Mbox may request a second write buffer 
entry be processed while waiting for the Cbox to finish the first. The write buffer entries are 
invalidated and placed in the free entry queue in the order that the requests complete. That 


» order may be different than the order in which the requests were made. 


The Mbox also requests a write buffer entry be processed every 64 cycles if there is even one 
valid entry. This ensures writes do not wait forever to be written to memory. Note that the timer 
which spurs this is free running. 


When a LDL_L or LDQ_L is processed by the Mbox, the Mbox requests processing of the next 
pending write buffer request. This increases the chances of the write buffer being empty when a 
STL_C or STQ_C is issued. 


The Mbox continues to request that write buffer entries be processed as long as one contains a 
STQ_C, STL_C, FETCH, FETCH_M instruction or as long as one is marked by a WMB flag or 
there is an MB being executed by the Mbox. This insures that these instructions are finished as 
quickly as possible. 


Every store that does not merge in the write buffer is checked against every valid entry. If any 
is an address match, then the WMB flag is set on the newly allocated write buffer entry. This 
prevents the Mbox from sending two writes to exactly the same block to the Cbox. The Cbox 
does not necessarily complete writes in the order in which they were issued, and reordering two 
writes to the same block can lead to an incorrect final result. 


Load misses are checked in the write buffer for conflicts. The granularity of this check is an 
INT32; any load matching any write buffer entry's address is considered a hit even if it does not 
access an INT4 marked for update in that write buffer entry. If a load hits in the write buffer, a 
conflict bit is set in the load’s MAF entry which prevents the load from being issued to the Chox 
before the conflicting write buffer entry has been issued (and completed). At the same time, the 
no-merge bit is set in every write buffer entry with which the load hit. A write buffer flush flag is 
also set. The Mbox continues to request that write buffer entries be processed until all the entries 
which were ahead of the conflicting write(s) at the time of the load hit have been processed. 


Some writes can not be processed in the Scache without external environment involvement. To 
support this, the Mbox retransmits a write at the Cbox’s request. This situation arises when the 
Scache block is not dirty when the write is issued or when the access misses in the Scache. 
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2.5.7 MB Instruction | 


The Mbox processes the MB instruction by first completing all outstanding loads and flushing the 
write buffer. It delays issuing the MB until all loads in the MAF and all writes in the write buffer 
have completed. The Mbox then issues the MB to the Cbox and waits until the Cbox signals that 
the MB has been processed before signaling the Ibox that the MB is complete. BC_CTL<EI_OPT_ 
CMD> determines whether the Cbhox processes the MB by issuing it on the pins and waiting for 
acknowledgement. If BC_CTL<EI_OPT_CMD> is not set, the Cbox retires the MB and immediately 
signals the Mbox that it has been processed. The Ibox stops issuing Mbox instructions after 
issuing the MB until the signal telling it to start again. 


2.5.8 Ibox Read Requests 


The Mbox has a four entry file of Ibox read requests. There is a strict one-for-one mapping 
between these request file entries and the four entries in the refill buffer in the Ibox. Allocation 
of these entries is controlled by the Ibox. The Ibox never reuses an entry until the previous read 
has completed. For Istream reads in non-cacheable space, the Mbox marks all INT8s as accessed 
in the request to the Chox. 


2.5.9 Mbox Arbitration 


The Mbox arbitrates among the pending Ibox requests, load misses, and write buffer requests to’ 
decide which is the next request to be sent to the Scache and Cbhox. The Cbox overrides Mbox 
arbitration to handle fills and system bus requests (invalidates and probes) and to force a write 
buffer request to reissue when required by shared block write processing in the Cbox. Normally,. 
load misses are the highest priority Mbox request, followed by Ibox requests and write buffer. 
requests. Write buffer requests become higher priority than reads when a write buffer flush 
condition exists. 


In some cases a request is refused by the Cbox due to lack of resources or a conflict. The Mbox 
places these refused requests in a replay queue. When arbitrating for an entry in the replay 
queue, the Mbox uses a priority higher than any other Mbox source. However, when only one 
replay queue entry is allocated, the Mbox delays arbitrating for the replay queue entry such that 
other Mbox requests can slip in between replays of refused commands. Sometimes the Cbox will 
be able to process the other request despite the conflict associated with the replayed request. 
Once the Mbox has two or more commands in the replay queue, it stops sending new references 
(because those too might be refused). 


2.6 The Cbox 


The Chox controls the Scache and the interface to the DECchip 21164-AA pin bus. It responds to 
all Mbox generated requests: load misses, instruction fetches and prefetches, and write buffer re- 
quests. It also implements a generic writeback cache protocol for the Scache and Beache (external 
cache). Chapter 4 describes the DECchip 21164-AA pin bus and coherence protocol. 


Internal data transfers between the Mbox (and Ibox) and the Cbox are made via 16-byte buses. 
Since the internal cache fill block size is 32 bytes, cache fill operations result in two data transfers 
from the Chox to the appropriate cache. Since each write buffer entry is 32 bytes in size, write 
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transactions may result in two data transfers from the write buffer to the Scache and/or the 
external caches. 


The Scache is fully pipelined and is able to provide fill data at a sustained rate of two INT8s 
per CPU cycle indefinitely. It is writeback and write allocate. Writes which hit in a private-dirty 
block are processed in a pipelined fashion at a rate of 1 INT16 per CPU cycle. Thus, extremely 
high data bandwidths are supported by the Scache. 


The Scache and Beache block sizes are selected to be 32 or 64 bytes by SC_CTL<SC_BLK_SIZE>. The 
Scache and Beache block sizes are always the same. 


The optional Bcache supports high data bandwidth as well. It can provide fill data at a rate as 
high as one INT16 every 4 CPU cycles if pipelined, though the Bcache in many systems operates 
at a significantly slower rate. Bandwidth of Scache writebacks into the Bcache is the same, one 
INT16 per 4 CPU cycles. Writeback bandwidth into the Beache is optimized by maintaining a 
modified bit for each INT16 in each Scache block. Only those INT16s that have actually been 
modified since the block was allocated in the Scache are written back to the Bcache. Scache 
victim writebacks can therefore take one to four Bcache cycles or not occur at all, depending on 
the state of the modified bits. 


Programs which organize (block) their data such that it fits in the Scache for phases of execution 
will benefit most significantly from the high data bandwidths available from the DECchip 21164- 
AA Chox. Data blocked to fit in the Beache will benefit from the high Bcache bandwidth supported, 
but only to the degree that the particular system’s Beache has high bandwidth and never as much 
as for data blocked to fit in the Scache. 


The Scache is set associative but is kept a subset of the larger externally implemented Bcache 
which is always direct mapped. Logic associated with the Scache tag comparators detects the 
case in which an Scache miss will cause a block in the Scache to be evicted from both the Beache 
and Scache. If the Scache victim is dirty, it is copied from the Scache to the Beache before the. 
new read is allowed to access the Bcache and cause the Beache block to be copied back to main 
memory. In other cases, Scache victims are buffered and written back to the Bceache after reading 
the new block from the Beache. 


The Chox detects Scache references to INT64 blocks that have already missed in the Scache. 
They effectively stall the Scache until the fill occurs. When they proceed they should Scache hit. 
A special case occurs when the Scache block size is 64 bytes and the second Scache miss is an 
access to the other INT32 within an outstanding INT64 reference. Such a miss is merged in the 
Chox such that the Scache pipeline does not stall. The INT64 fill will service both of the original 
INT32 references when it arrives. Only one such merge can occur for a particular Scache miss; 
once both halves of an INT64 Scache block have been requested, no additional merging is done. 


NOTE 


The Chox never merges two INT32 references in non-cacheable space (physical address 
bit <395=1). This is required so that the Cbox can inform the environment precisely 
which INT8s are accessed for each non-cacheable space read reference. 


Up to two Scache misses can be processed by the Cbhox. These can be Beache hits or misses. 
Once one of any two Scache misses is resolved, a new Scache miss can be accepted. Once two 
Scache misses are outstanding, the Cbhox and Scache stop accepting new transactions until one 
of the outstanding misses is completed. Merging of the kind described in the previous paragraph 
affects this by effectively condensing two misses into one. Merging can not occur if two misses 
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are outstanding already, so with merging of INT32s into INT64s, up to three misses can be 
outstanding. 


The Cbox implements a writeback coherence protocol characterized by write allocate, write inval- 
idate, and snooping for dirty data in all coherent caches in the system for each bus read issued 
by each processor. The Chox facilitates this protocol by: 


e Interacting with the external bus interface so that it may maintain an accurate Beache du- 
plicate tag store (or Scache duplicate tag store in the absence of a Bcache). An accurate 
duplicate tag store always has the correct dirty status for each cache block. 


¢ Maintaining shared and dirty status bits for each Scache block. Writes to private-dirty blocks 
occur without external activity. Writes to shared blocks are broadcast externally. Writes to 


blocks not shared and not dirty require interface acknowledgment to transition into the dirty 
state. 


¢ Fulfilling reads to dirty blocks in the Scache or Beache by providing the data directly from 
the appropriate cache. Reads from the system bus are processed at highest priority. If the 
block is dirty, the data is transmitted (under external control) from the appropriate cache. 


Normally, the Mbox’s arbiter determines the next request that enters the Scache pipeline. The 
Chox causes override of the Mbox arbiter in the following cases: 


¢ Seache fills from the Beache or system environment. 
¢ Processing of system probes and invalidates. 


¢ Write broadcast data transmission or write to a private block after receiving acknowledgment 
from the interface. 


2.7 Fbox 


DECchip 21164-AA has an on-chip pipelined Fbox capable of executing both DEC and IEEE. 
floating point instructions. IEEE floating point datatypes S and T are supported with all rounding 
modes. DEC floating point datatypes F and G are fully supported. There is limited support for D 
floating point format. The Fbox contains a 32-entry 64-bit floating point register file and a user 
accessible control register, FPCR, containing round mode controls and exception flag information. 
The Fbox contains two execution pipelines, a floating point multiply pipeline and a floating point 
add pipeline (which executes all Fbox instructions except multiply operations). The floating point 
divide unit is associated with the floating point add pipeline but is not itself pipelined. The Fbox 
can accept a multiply instruction and a non-multiply instruction every cycle, with the exception 
of floating point divide instructions. The latency for all instructions except divide is four cycles. 
Bypassers are provided to allow issue of instructions which are dependent on prior results while 
those results are written to the register file. For detailed information on instruction timing, refer 
to Section 2.10. 


The floating point multiply pipeline and floating point add pipeline are both capable of executing 
the CPYS instruction. This is important for two reasons. It allows floating point NOPs to be 
executed in either floating point pipe and it allows floating point data to be moved from register 
to register simultaneously with execution of any floating point operation. (Recall that floating 
point NOP is CPYS F31,F31,F31.) 
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The floating point register file has five read ports and four write ports. Four of the read ports 
are used by the two pipelines to source operands. The remaining read port is used by floating 
point stores. Two of the write ports are used to write results from the two pipelines. The other 
two write ports are used to write fills from floating point loads. The Mbox arbitrates between 
floating point loads that hit in the Deache and floating point fills from the Cbox, making certain 
that only one register need be written per fill port in each cycle. Floating point loads that conflict 
with Cbox fills for use of these write ports are forced to miss in the Deache so that the Chox fill 
can occur. The purpose of this is to maximize the available bandwidth for floating point loads. 


2.8 Cache Organization 


DECchip 21164-AA includes three on-chip caches. All memory cells are fully static CMOS 6T 
structures. Parity protection is implemented in all on-chip caches. 


2.8.1 Data Cache 


The DECchip 21164-AA data cache, the Dcache, is a dual-ported cache implemented as two 8 

Kbyte cache banks. It is a write-through, read-allocate direct mapped physical cache with 32- 

_ byte blocks. One bank is associated with each of the two Ebox execution pipelines, EO and E1. 

The cache banks contain exactly the same data. The Chox keeps the Deache coherent and keeps 
: it a subset of the Scache. 


A load that misses in the Deache will result in a Deache fill. The two banks are filled at the same 
time with identical data. 


2.8.2 Instruction Cache 


The DECchip 21164-AA instruction cache, the Icache, is an 8 Kbyte virtual direct-mapped cache. 
Icache blocks contain 32-bytes of instruction stream data, associated predecode data, the corre- 
sponding tag, a seven-bit ASN field (MAX_ASN=127), a one-bit ASM field and a 1 bit PALcode 
indication per block. Coherency with memory is not maintained by Ibox hardware. The virtual 
instruction Icache is kept coherent with memory via the IMB PAL call, as specified in the Alpha 
SRM. 


The DECchip 21164-AA virtual instruction cache is kept coherent with changes to PTEs via the 
IMB PAL call or by assigning a new ASN to the affected process. The TBIA, TBIAP, and TBIS 
PAL calls do not affect the contents of the Icache in any way. 


2.8.3 Second Level Cache 


The DECchip 21164-AA second level cache, Scache, is a 96 Kbyte, 3-way set associative, physical, 
writeback, write-allocate cache with 32 or 64 byte blocks (configured by SC_CTL<SC_BLK_SIZE>). 
It is a mixed data and instruction cache. The Scache is fully pipelined; it processes reads and 
writes at the rate of 1 INT16 per CPU cycle and can alternate between read and write accesses 
without “bubble” cycles. 


If the Scache block size is configured to 32 bytes, the Scache is organized as three sets of 512 
blocks where each block consists of two 32-byte subblocks. Otherwise the Scache is three sets of 
512 64-byte blocks. 
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Scache tags contain the following special bits for each 32-byte sub-block: one dirty bit, one shared 
bit, two INT16 modified bits, and one valid bit. Dirty and shared are the coherence state of 
the subblock required for the cache coherence protocol. The modified bits are used to prevent 
unnecessary writebacks from the Scache to the Beache. The valid bit indicates the subblock is 
valid. In 64-byte block mode, the block is made up of two 32-byte subblocks and the valid, shared, 
and dirty bits in one subblock always match the corresponding bit in the other subblock 


The Scache tag compare logic contains extra logic to check for blocks in the Scache which map 
to the same Bceache block as a new reference. This allows the Scache block to be moved to the 
Beache (if dirty) before the block is evicted because of the new reference missing in the Beache. 


The Scache supports write broadcast by merging write data with Scache data in preparation for 
a write broadcast as required by the coherence protocol. 


2.8.4 External Cache - Bcache 


The Chox implements control for an optional external, direct mapped, physical, writeback, write 
allocate cache with 32 or 64 byte blocks. (The block size is configured by SC_CTL<SC_BLK_SIZE>). 
It is a mixed data and instruction cache. Beache sizes of 1, 2, 4, 8, 16, 32, and 64 Mbytes are 
supported. See Chapter 4. 


2.9 Pipeline Organization 


DECchip 21164-AA has an seven stage pipeline for integer operate and memory reference in- 
structions. Floating point operate instructions progress through a nine stage pipeline. The Ibox 
maintains state for all pipeline stages to track outstanding register writes. The pipeline diagrams 
below show the DECchip 21164-AA pipeline for several significant examples. The first four cycles 
are executed in the Ibox and the later stages are executed in the Ebox, Fbox, Mbox, and Chox. 
There are bypass paths that allow the result of one instruction to be used as a source operand of 
a following instruction before it is written to the register file. 
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Figure 2-2: Pipeline Examples 
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I 
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(Deache miss) |Icache|and | slot |check |caic | | | t | | |Deache|wrt| 
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| | | ! 1 |dcache| miss|Scache|Scache|Scache| fill |for| |} | 
1 | | | i | | | tag | hit | data j [mat] | 
| | | | | | | ! 1 | 1 ! { 
Store jaccess|buffer| |dirty laddr | | | 
(Deache hit) |Icache|and | slot {check |cale | I | 


{ t { I | |Deache| hit {Dcache/j 


I 
I 
| {decode | lérd RF| |access|detectiwrite | | 
| 
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Table 2-2: Pipeline Examples - All Cases 


Pipe Stage Events 

0 Access Icache tag and data. 

1 Buffer 4 instructions, check for branches, calculate branch displacements, check for 
Ieache hit. 

2 slot - swap instructions around so they are headed for pipelines capable of executing 


them. Stall preceding stages if all instructions in this stage can not issue simultane- 
ously because of function unit conflicts. 


Check the operands of each instruction to see that the source is valid and available 

and that no write-write hazards exist. Read the integer register file. Stall preceding 

stages if any instruction can not be issued. All source operands must be available at 
the end of this stage for the instruction to issue. 


ao 


Table 2-3: Pipeline Examples - Integer Add 





Pipe Stage Events 

4 Do the add. 

5 Result available for use by an operate this cycle. 

6 Write the integer register file. Result available for use by an operate this cycle. 
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Table 2-4: Pipeline Examples - Floating Add 


Pipe Stage 


Events 





Ooo nna pp 


Read the floating register file. 

First cycle of Fbox add pipeline. 

Second cycle of Fbox add pipeline. 

Third stage of Fbox add pipeline. 

Fourth stage of Fbox add pipeline. Write the floating point register file. 
Result available for use by an operate this cycle. 


Table 2-5: Pipeline Examples - Load (Dcache hit) 


Pipe Stage 
4 
5 


Events 


Calculate the effective address. Begin the Deache data and tag store access. 


Finish the Deache data and tag store access. Detect Deache hit. Format the data as 
required. Scache arbitration defaults to E0 in anticipation of a possible miss. 


Write the integer or floating register file - data available for use by an operate this 
cycle. 


Table 2-6: Pipeline Examples - Load (Dcache miss) 


Pipe Stage 
4 
5 


Events 


Calculate the effective address. Begin the Deache data and tag store access. 


Finish the Deache data and tag store access. Detect Deache miss. Scache arbitration 
defaults to EO in anticipation of a possible miss. A load in E1 would be delayed at least 
one more cycle since default arbitration speculatively selects E0. 


Begin Scache tag read. 
Finish Scache tag read. Begin detecting Scache hit. 


Finish detecting Scache hit. Begin accessing the correct Scache data bank. (Bcache 
index at pins; Beache access begins) 


Finish Scache data bank access. Begin sending fill data from Scache. 
Finish sending fill data from Scache. Begin Deache fill. Format the data as required. 


Finish Deache fill. Write the integer or floating register file - data available for use by 
an operate this cycle. 


Table 2-7: Pipeline Examples - Store (Dcache hit) 


Pipe Stage 
4 
5 


Events 


Calculate the effective address. Begin the Deache tag store access. 


Finish the Dcache tag store access. Detect Dcache hit. Send store to the write buffer 
simultaneously. 


Write the Deache data store if hit (write begins this cycle). 
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The DECchip 21164-AA pipeline divides instruction processing into four static and a number of 
dynamic stages of execution. The first four stages consist of the instruction fetch, buffer and 
decode, slotting, and issue check logic. These stages are static in that instructions may remain 
valid in the same pipeline stage for multiple cycles while waiting for a resource or stalling for 
other reasons. Dynamic stages always advance state and are unaffected by any stall in the 
pipeline. A pipeline stall may occur while zero instructions issue, or while some instructions of a 
set of four issue and the others are held at the issue stage. A pipeline stall implies that a valid 
instruction or instructions is (are) presented to be issued but can not proceed. 


Upon satisfying all issue requirements, instructions are issued into their slotted pipeline. After 
issuing, instructions cannot stall in a subsequent pipe stage. It is up to the issue stage to ensure 
that all resource conflicts are resolved before an instruction is allowed to continue. The only 
means of stopping instructions after the issue stage is an abort condition. Note that the term 
abort as used here is different from its use in the Alpha SRM. 


Aborts may result from a number of causes. In general, they may be grouped into two classes, 
namely exceptions (including interrupts) and non exceptions. The basic difference between the 
two is that exceptions require that the pipeline be drained of all outstanding instructions before 
restarting the pipeline at a redirected address. In either case, the pipeline must be flushed of all 
instructions which were fetched subsequent to the instruction which caused the abort condition. 


. This includes aborting some instructions of a multiply-issued set in the case of an abort condition 
* on the one instruction in the set. The non-exception case, however, does not need to drain 


the pipeline of all outstanding instructions ahead of the aborting instruction. The pipeline can 
be immediately restarted at a redirected address. Examples of non exception abort conditions 
are branch mispredictions, subroutine call/return mispredictions, and replay traps. Data cache 
misses can cause aborts or issue stalls depending on the cycle-by-cycle timing. 


In the event of an exception other than an arithmetic exception, the processor aborts all instruc- 
tions issued after the exceptional instruction as described above. Due to the nature of some 
exception conditions, this may occur as late as the integer register file write cycle. (In the case of 
an arithmetic exception, the processor may execute instructions issued after the exceptional in- 
struction.) Next, the address of the exceptional instruction is latched in the EXC_ADDR IPR. (In 
the case of an arithmetic exception, the address latched in the EXC_ADDR IPR is that of the lats 
instruction executed which may be a later instruction than the exceptional instruction.) When 
the pipeline is fully drained, the processor begins instruction execution at the address given by 
the PALcode dispatch. The pipeline is drained when all outstanding writes to both the integer 
and floating point register file have completed and all outstanding instructions have passed the 
point in the pipeline such that all instructions are guaranteed to complete without an exception 
in the absence of a machine check. 


Replay traps are aborts that occur when an instruction requires a resource that is not available 
at some point in the pipeline. Generally these are Mbox resources whose availability could not 
be anticipated accurately at issue time. If the necessary resource is not available when the 
instruction requires it, the instruction is aborted and the Ibox begins fetching at exactly that 
instruction, thereby replaying the instruction in the pipeline. A slight variation on this is the 
load-miss-and-use replay trap in which an operate is issued just as Deache hit is being evaluated 
to determine if one of the instructions operands is valid. If it turns out that there is a Deache 
miss, then the operate is aborted and replayed. 
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It should be noted that there are two basic reasons for non-issue conditions. The first is a pipeline 
stall wherein a valid instruction or set of instructions are prepared to issue but cannot due to a 
resource conflict (register conflict. or function unit conflict). These type of non-issue cycles can be 
minimized through code scheduling. The second type of non-issue conditions consist of pipeline 
bubbles where there is no valid instruction in the pipeline to issue. Pipeline bubbles result from 
the abort conditions described above. In addition, a single pipeline bubble is produced whenever a 
branch type instruction is predicted to be taken, including subroutine calls and returns. Pipeline 
bubbles are reduced directly by the instruction buffer hardware and through bubble squashing, 
but can also be effectively minimized through careful coding practices. Bubble squashing involves 
the ability of the first four pipeline stages to advance whenever a bubble or buffer slot is detected 
in the pipeline stage immediately ahead of it while the pipeline is otherwise stalled. 
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2.10 Scheduling and Issuing Rules 
2.10.1 


Instruction Class Definition and Instruction Slotting 


It is important to note that the following scheduling and multiple issue rules are only performance 
related. There are no functional dependencies related to scheduling or multiple issuing. The 
scheduling and issuing rules are defined in terms of instruction classes. The table below specifies 
all of the instruction classes and the pipeline which executes the particular class. With a few 
additional rules, Table 2-8 gives the information necessary to determine the functional resource 
conflicts that determine the which instructions can issue in a given cycle. 


Table 2-8: Instruction Classes and Slotting 

Class Name Pipeline 

LD EO! or E1? 

ST EO 

MBX EO 

RX EO 

MXPR EO or El depending on 
the IPR 

IBR El 

FBR Fa? 

JSR El 

IADD EO or El 

ILOG EO or El 

SHIFT E0 

CMOV EO or El 

ICMP EO or El 

IMULL 0 

IMULQ E0 

IMULH EO 

FADD FA 

1¥box pipeline 0. 


2Ebox pipeline 1. 
3Fbox "add" pipeline. 
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Instruction List 
all loads except LDx_L 
all stores except STx_C. 


LDx_L, MB, WMB, STx_C, HW_LD-lock, HW_ST- 
cond, FETCH 


RS, RC 
HW_MFPR, HW_MTPR 


integer conditional branches 
floating point conditional branches 


jump to subroutine instructions JMP, JSR, RET, or 
JSR_COROUTINE, BSR, BR, HW_REI, CALLPAL 


ADDL ADDL/V ADDQ ADDQ/V SUBL SUBL/V SUBQ 
SUBQ/V S4ADDL S4ADDQ S8ADDL S8ADDQ S4SUBL 
S4SUBQ S8SUBL S8SUBQ LDA LDAH 


AND BIS XOR BIC ORNOT EQV 


SLL SRL SRA EXTQL EXTLL EXTWL EXTBL EXTQH 
EXTLH EXTWH MSKQL MSKLL MSKWL MSKBL 
MSKQH MSKLH MSKWH INSQL INSLL INSWL 
INSBL INSQH INSLH INSWH ZAP ZAPNOT 


CMOVEQ CMOVNE CMOVLT CMOVLE CMOVGT 
CMOVGE CMOVLBS CMOVLBC 


CMPEQ CMPLT CMPLE CMPULT CMPULE CMPBGE 
MULL MULL/V 

MULQ MULQ/V 

UMULH 


floating point operates except multiply and CPYS (but 
including CPYSN and CPYSE). 
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Table 2-8 (Cont.): Instruction Classes and Slotting 


Class Name Pipeline Instruction List 

FDIV FA floating point divide. 

FMUL FM‘ floating point multiply 

FCPYS FM or FA CPYS (but not CPYSN or CPYSE) 
MISC EO RPCC, TRAPB 

UNOP none UNOP 

4Fbox multiply pipeline. 


2.10.11 Slotting 


The slotting function in Ibox determines which instructions will be sent forward to attempt to 
issue. The slotting function detects and removes all static functional resource conflicts. The set 
of instructions output by the slotting function will issue if no register or other dynamic resource 
conflict is detected in stage 3 of the DECchip 21164-AA pipeline. 


The basic slotting algorithm is simple. Starting from the first (owest addressed) valid instruction 
in the INT16 in stage 2 of the DECchip 21164-AA Ibox pipeline, attempt to assign that instruction 
to one of the four pipelines (E0, E1, FA, FM). If it is an instruction which can issue in either of 
EO or El, put it in EO except that if one the following is true, put it in E1. 


¢ EO isn’t free and E1 is free. 
e The next integer instruction} in this INT16 can only issue in EO. 


If the current instruction is one which can issue in either FA or FM, put it in FA unless FA 
isn’t free. Mark the pipeline selected by this process as taken and begin again with the next 
sequential instruction. Stop when an instruction can not be allocated an execution pipeline 
because any pipeline it can use is already taken. The slotting logic also enforces the special rules 
listed below, stopping the slotting process when a rule would be violated by allocating the next 
instruction an execution pipeline. Note that the slotting logic doesn’t send instructions forward 
out of logical instruction order because DECchip 21164-AA always issues instructions in order. 


1. An instruction of class LD can not be simultaneously issued with an instruction of class ST. 


2. All instructions are discarded at the slotting stage after a predicted-taken IBR or FBR class 
instruction, or a JSR class instruction. 


3. After a predicted not-taken IBR or FBR, no other IBR, FBR, or JSR class can be slotted 
together. 
4. The following cases are detected by the slotting logic: 
¢ from lowest address to highest within an INT16, the arrangement 
I-instruction, F-instruction, I-instruction, I-instruction, 
where I-instruction is any instruction that can issue in one or both of EO or El and 
F-instruction is any instruction that can issue in one or both of FA or FM. 
¢ from lowest address to highest within an INT16, the arrangement 
F-instruction, I-instruction, I-instruction, I-instruction. 


£ In this context, an integer instruction is one which can issue in one or both of EO or E1, not FA or FM. 


DIGITAL RESTRICTED DISTRIBUTION DECchip 21164-AA Micro-Architecture 2-25 


DEC Chip 21164-AA (EV5 CPU) Specification, Revision 1.9, December 1992 


When this type of case is detected, the first two instructions are forwarded to the issue point 
in one cycle, and the second two are sent only when the first two have both issued, provided 
no other slotting rule would prevent the second two from being slotted in the same cycle. 
This makes a code sequence that was optimally scheduled for EV4 perform at least as well 
on DECchip 21164-AA. 


2.10.2 Instruction Latencies 


After slotting, instruction issue is governed by the availability of registers for read or write and the 
availability of the floating divide unit and the integer multiply unit. There are producer-consumer 
dependencies, producer-producer dependencies (also known as write after write conflicts) and 
dynamic function unit availability dependencies (integer multiply and floating divide). Ibox logic 
in stage 3 of the DECchip 21164-AA pipeline detects all these conflicts. 


For most instructions the latency to produce a valid result is fixed. The exceptions are loads 
which miss, floating point divides, and integer multiplies. Table 2-9 gives the latencies for each 
instruction class. A latency of 1 means that the result may be used by an instruction issued one 
cycle after the producing instruction. Note that most latencies are a property of the producer only; 
except for integer multiply latencies, there are no variations in latency due to which particular 
_ unit produces a given result relative to the particular unit that consumes it. Even in the case 
® of integer multiply, the instruction is issued at the time determined by the standard latency 
* numbers, but the multiply’s latency is dependent on which previous instructions produced its 
operands and when they executed. 


Table 2-9: Instruction Latencies 


Additional time before 
result available to inte- 


Class Latency ger multiply unit} 
LD Deache hits, latency=2; Dcache miss/Scache hit, latency=7 or 1 cycle 
longer§ 
ST Stores produce no result - 
MBX LDx_L always Deache misses, latency depends on memory’ - 


subsystem state; STx_C, latency depends on memory subsys- 
tem state; MB, WMB, and FETCH produce no result 


RX RS, RC, latency=1 2 cycles 

MXPR HW_MFPR, latency=1, 2 or longer depending on the IPR; HW_ _——1 or 2 cycles 
MTPR, produces no result 

IBR produce no result - 


§When idle, Scache arbitration predicts a load miss in E0. If a load actually does miss in E0, it is sent to the Scache right 
away. If this hits and no other event in the Cbox affects the operation, the requested data is available for bypass in 7 cycles. 
Otherwise, the request takes longer, possibly much longer depending on the state of the Scache and Chox. It should be possible 
to schedule some unrolled code loops for Scache using a data access pattern that takes advantage of the Mbox load merging 
function, achieving high throughput with large data sets. 


+The multiplier is unable to receive data from Ebox bypass paths. The instruction issues at the expected time, but its latency 
is increased by the time it takes for the input data to become available to the multiplier. For example, an IMULL issued one 
cycle later than an ADDL which produced one of its operands has a latency of 10 (8 + 2). If the IMULL issued two cycles later 
than the ADDL, the latency is 9 (8 + 1). 
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Table 2-9 (Cont.): instruction Latencies 


Additional time before 





result available to inte- 
Class _ Latency ger multiply unit} 
FBR produce no result - 
JSR all but HW_REI, latency=1; HW_REI produces no result 2 cycles 
IADD latency=1} 2 cycles 
ILOG latency=1} 2 cycles 
SHIFT latency=1 2 cycles 
CMOV latency=2 1 cycle 
ICMP latency=1 2 cycles 
IMULL latency=8 plus up to 2 cycles of added latency depending on 1 cycle 


the source of the datat; latency until next IMULL, IMULQ, or 
IMULH can issue if there are no data dependencies is 4 cycles 
plus the number of cycles added to the latency. 


IMULQ latency=12 plus up to 2 cycles of added latency depending on 1 cycle 
the source of the data}; latency until next IMULL, IMULQ, or 
IMULH can issue if there are no data dependencies is 8 cycles 
plus the number of cycles added to the latency. 


IMULH latency=14 plus up to 2 cycles of added latency depending on 1 cycle 
the source of the datat; latency until next IMULL, IMULQ, or 
IMULH can issue if there are no data dependencies is 8 cycles 
plus the number of cycles added to the latency. 


FADD latency=4 - 
FDIV data dependent latency is preliminary, 2.4 bits per cycle aver- = - 


age rate; next floating divide can be issued in the same cycle 
the result of the previous divide’s result is available, regardless 


of data dependencies. 
FMUL latency=4 2 
FCPYS latency=4 - 
MISC RPCC, latency=2; TRAPB produces no result 1 cycle 
UNOP UNOP produces no result - 


+The multiplier is unable to receive data from Ebox bypass paths. The instruction issues at the expected time, but its latency 
is increased by the time it takes for the input data to become available to the multiplier. For example, an IMULL issued one 
cycle later than an ADDL which produced one of its operands has a latency of 10 (8 + 2). If the IMULL issued two cycles later 
than the ADDL, the latency is 9 (8 + 1). 


+A special bypass provides an effective latency of 0 (zero) cycles for an ICMP or ILOG producing the test operand of an IBR 
or CMOV. This is only true when the IBR or CMOV issues in the same cycle as the ICMP or ILOG which produces the test 
operand of the IBR or CMOV. In all other cases the effective latency of ICMP and ILOG is 1 cycle 


ISSUE 


The actual issue times of floating divides after floating divides is still open. The above 
statement is approximately correct. 
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2.10.3 Producer-Producer Latency 


Producer-producer latency, also known as write after write conflicts, cause issue-stalls to preserve 
write order. If two instructions write the same register, they are by the Ibox forced to do so in 
different cycles. This is necessary to ensure that the correct result is left in the register file after 
both instructions have executed. For most instructions, the order in which they write the register 
file is dictated by issue order, however IMUL, FDIV and LD instructions may require more time 
than other instructions to complete. Subsequent instructions that write the same destination 
register are issue-stalled to preserve write ordering at the register file. 


Cases involving an intervening producer-consumer conflict are of interest. They can occur com- 
monly in a multiple-issue situation when a register is re-used. In these cases, producer-consumer 
latencies are equal to or greater than the required producer-producer latency as determined by 
write ordering and therefore dictate the overall latency. 


An example of this case is shown in the code: 


LDQ R2,D(RO) ; R2 destination . 
ADDQ R2,R3,R4 7; wr-rd conflict stalls execution waiting for R2 
LDQ R2,D(R1} 7: wr-wr conflict may dual issue when addq issues 


In general, producer-producer latency are determined by applying the rule that register file writes 
must occur in the correct order (which is enforced by Ibox hardware). Two IADD or ILOG class 
instructions that write the same register will issue at least one cycle apart. The same is true 
of a pair of CMOV class instructions, even though their latency is 2. For IMUL, FDIV and 
LD, producer-producer conflicts with any subsequent instruction results in the second instruction 
being issue-stalled until the IMUL, FDIV, or LD is about to complete. The second instruction is 
issued as soon as it is guaranteed to write the register file after the IMUL, FDIV, or LD, at least 
one cycle afterwards. 


If a load writes a register and within two cycles a subsequent instructions writes the same register, 
the subsequent instruction is issued speculatively assuming the load hits. If the load misses, a 
load-miss-and-use trap is generated, causing the second instruction to be replayed by the Ibox. 
When the second instruction again reaches the issue point, it is issue-stalled until the load fill 
occurs. 


2.10.4 DECchip 21164-AA Issue Rules 


The following is a list of conditions that prevent DECchip 21164-AA from issuing an instruction. 


1. No instruction can be issued until all of it’s source and destination registers are clean, i.e. all 
outstanding writes to the destination register are guaranteed to complete in issue order and 
there are no outstanding writes to the source registers or those writes can be bypassed. 
Technically, load-miss-and-use replay traps are an exception to this rule. The consumer of the 
load’s result issues and is aborted because a load was predicted to hit and discovered to miss 
just as the consumer instruction issued. In practice, the only difference is that the latency 
of the consumer may be longer than it would have been had the issue logic known the load 
would miss in time to prevent issue. 

2. An instruction of class LD can not be issued in the second cycle after an instruction of class 
ST is issued. 

3. No LD, ST, LDX_L, MXPR (to an Mhox register), or MBX class instruction after an MB 
instruction has been issued until until the MB has been acknowledged on the external pin 
bus. 
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4. No LD, ST, LDX_L, MXPR (to an Mbox register), or MBX class instruction after a STx_C (or 
HW_ST-cond) instruction has been issued until the Mbox writes the success/failure result of 
the STx_C (HW_ST-cond) in its destination register. 

5. No IMUL instructions can be issued if the integer multiplier is busy. 

6. No floating point divide instructions can be issued if the floating point divider is busy. 


7. No instruction can be issued to pipe EO exactly two cycles before an integer multiplication 
completes. 


8. No instruction can be issued to pipe FA exactly TBD cycles before an floating point divide 
completes. 


9. No instruction can be issued to pipe E0 or E1 exactly two cycles before a integer register fill 
is requested (speculatively) by the Chox, except IMULL, IMULQ, IMULH instructions and 
instructions which do not produce a result at all. 

10. No LD, ST, LDX_L, or MBX class instruction can be issued to pipe EO or E1 exactly one cycle 
before a integer register fill is requested (speculatively) by the Cbox. 

11. No instruction issues after a TRAPB instruction until all previously issued instructions are 
guaranteed to finish without generating a trap other than a machine check. 


Subject to the above rules, all instructions sent to the issue stage (stage 3) by the slotting logic 
(stage 2) are issued. If issue is prevented for a given instruction at the issue stage, all logically 
subsequent instructions at that stage are prevented from issuing automatically. DECchip 21164- 
AA only issues instructions in order. 
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2.11 Revision History 


Table 2-10: Revision History 


Who When Description of change 
John Edmondson 9-Feb-1992 Initial release. 

John Edmondson 1-May-1992 Update to version 1.5. 
John Edmondson 29-November- Update to version 1.8. 


1992 
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Chapter 3 







PALcode and IPRs 


3.1 Overview 


PALcode is macrocode that runs with privilege 
and interrupts disabled. PALcode has privilege to’ 
such as physical data stream references and Intern 
DECchip 21164-AA, these opcodes are: HW. 
PALmode is the CPU state that distinguis 


Hardware calculates PALcode ent 
EXC_ADDR IPR with a return PC 
read and written using the HW_MFP 
returns instruction flow to the PC 
speed execution by predicting the 


PC<0> is used as the PALmode. 





struction stream mapping disabled, 












tow is begun. _ EXC -ADDR can also be directly 
WV. TPR instructions. The HW_REI instruction 









ie hardware and to PALcode itself. When the CPU 
d this bit remains set as we move through the PAL 
d behaves as if the PC were still longword aligned 


tecture group will provide PALcode to support both the OpenVMS 
e will also provide a DECchip 21164-AA PALcode violation checker 


ifferent types of PALcode entry points: CALL_PAL and traps. 
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3.2.1 CALL_PAL 










following the CALL_PAL is loaded into EXC_ADDR and is pi 
Stack. 


fall in the shadow of: 


¢ IMUL 
¢ Any Floating Point operate, especially FDIV 


The Microarchitecture chapter describes the | 


Each CALL_PAL instruction includes a functio 
next PC. The PAL OPCDEC flow will be started if € 
¢ in the range 40(hex) to 7F(hex) inclusive. 

¢ is greater than BF (hex). 
¢ between 00 and 3F(hex) inclusive,, AND? BSsé OD> is not equal to kernel. 


* PC<63:14> = PAL_BASE IPR: 
© PC<13>=1 

e¢ PC<12> = CALL_PAL 
e¢ PC<11:6> = CALL_PA 
e PC<5:1>=0 

* PC<0> = 1 - PAL 
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3.2.2 Traps 


PALcode is started up on a subset of the DECchip 21164-AA traps 


pushed with the 
a precise traps. 
Table 3-1 shows the PALcode trap entry points, and their off: ; AL BASE IPR. The 
table lists the entry points from highest to lowest priori ; 

traps works because DTB miss is not asserted when ths 





















Table 3-1: 





PALcode Trap Entry Points 





Entry Name Offset (hex) 

RESET 0000 

MCHK 0400 

ARITH 0500 

INTERRUPT 0100 e, software, and AST 
ITBMISS 0180 

IACCVIO 0080 é88 violation or sign check error on PC 
FEN 0580 nt Operation attempted with: 


EE operation with datatype other than S, T or Q 
OPCDEC gal Opcode 
DTBMISS_SINGLE 

DTBMISS_DOUBLE 


UNALIGN 


eam TBmiss 
Dstream TBmiss during Virtual PTE fetch 
Dstream unaligned reference 

Dstream fault or sign check error on VA 


tware executing with ICSR<HWE> set must use extreme care to obey all 
d in this chapter. 
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3.3.1 HW_LD 


The HW_LD instruction is used by PALcode to do special forms 
and Table 3-2 describe the format and fields of the HW_LD i 
are inhibited for HW_LD instructions. 





















Figure 3-1: HW_LD instruction 


t---------- tana ---- tonne ene tatata—tat—t—t----- = 
{ | | IPJAIW/QIVILI 

| | | JH|LIR|U/P/O} 

{ OPCODE | RA | RB \Y|T|TIA| TIC} 

| | | {S| |C|DIEtKI 

| | | )f IKI | tod 

fren nn nant ne nn nn nn poner nnn nt aitatetetetetenn- 


OPCODE 
RA 
RB 
PHYS the HW_LD is virtual. 
dress for the HW_LD is physical. Translation and memory manage- 
are inhibited. 
ALT 0 nt checks use Mbox IPR DTB_CM for access checks. 
ment checks use Mbox IPR ALT_MODE for access checks. 
WRTCK ement checks FOR and read access violations. 


agement checks FOR, FOW, read and write access violations. 


QUAD longword. 
quadword. 
VPTE virtual PTE fetch. Used by trap logic to distinguish single TBmiss from 


ble TBmiss. Access checks are performed in kernel mode. 
d_lock version of HW_LD. PAL must slot to EO pipe. 
Holds a 10-bit signed byte displacement. 
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3.3.2 HW_ST 


The HW_ST instruction is used by PALcode to do special form 
and Table 3-3 describe the format and fields of the HW_ST in, 
are inhibited for HW_ST instructions. : 


The Ibox logic will always slot HW_ST to pipe EO. 


Figure 3-2: HW_ST instruction 





3 22 22 


1 65 10 65432109 
fmm re tn tenet totetetete tesa nn= 
| | | IPIAIM|QIMIC] 
| | | |H|L{B| UBIO] 
| OPCODE | RA | RB LYIT}Z/A|Z{N| 
I | | IS} | [DI] ID} 
| | | Lode hao 
foc ce nen nn tn pt ttt tite tenn 


OPCODE 


RA 
RB ; 
PHYS 0 - The: dresser the HW_ST is virtual. 

dress for the HW_ST is physical. Translation and memory manage- 

Sure inhibited. 
ALT ent checks use Mbox IPR DTB_CM for access checks. 
ment checks use Mbox IPR ALT MODE for access checks. 

QUAD ord. 
COND conditional version of HW_ST. In this case, RA will be written with the value 
DISP olds a 10-bit signed byte displacement. 


.13 and 11 must be zero. 


DIGITAL CONFIDENTIAL PALcode and IPRs 3-5 


DEC Chip 21164-AA (EV5 CPU) Specification, Revision x, October 1993 


3.3.3 HW_REI 













ADDR IPR. The value in EXC_ADDR<0> will be used as the 
HW_REI. 


The Ibox uses the Return Prediction Stack to speed the ex 
different types of HW_REI: 


¢ Prefetch: In this case, the Ibox will begin fetching t! 
is the version of HW_REI that is normally used. 
¢ Stall Prefetch: This encoding of HW_REI inhiki #:$etch until the HW_REI itself is 
issued. Thus, this is the method used to syne. i 
the HW_REI. There is a rule that PALcode: 


ave,ore such HW_REI in an aligned 
block of four instructions. : 


Figure 3-3 and Table 3-4 describe the format and > HW_REI instruction. 


The Ibox logic will slot HW_REI to pipe E1. 





Figure 3-3: HW_REI instruction 
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3.3.4 HW_MFPR and HW_MTPR 
The HW_MFPR and HW_MTPR instructions are used to access 








via the PC buses. These HW_MFPRs have a latency of one cycl{HW. mer cycle x results in 
: will be moved to 


see Table 3-6. 


Figure 3-4 and Table 3-5 describe the format and fiéids 
instruction. 










Figure 3-4: HW_MFPR, HW_MTPR instructiog 


Table 3-5: 


OPCODE 


RA/RB t Source register for HW_MTPR. Destination register for HW_MFPR. 
Index ie i 









Index(hex) Ibox slots to Pipe 
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Table 3-6 (Cont.}: IPR Encodings 































Index(hex) 
ITB_IAP 


ITB_IS Ww 107 
SIRR R/W 108 
ASTRR 


ASTER 

EXC_ADDR 

EXC_SUM 

EXC_MASK 

PAL_BASE 

PS 

IPL 

INTID 

IFAULT_VA_FORM 

IVPTBR 

HWINT CLR 

SL_XMIT 

SL_RCV 

ICSR 

IC_FLUSH 

IC_PERR_STAT 

PMCTR 

PALtemp[0:23] 140-157 El 

DTB_ASN 200 E0 

DTB_CM Ww 201 Eo 

DTB_TAG 202 Eo 

DTB_PTE AW 203 E0 

DTB_PTE_TEMP. R 204 EO 

MM STAT ¢ R 205 E0 

VA R 206 E0 

VA_FORM R 207 E0 
Ww 208 Eo 
Ww 209 E0 
Ww 20A E0 
Ww 20B Eo 
Ww 20C EO 
Ww 20D Eo 
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Table 3-6 (Cont.): IPR Encodings 


IPR Access Index(hex) ipe 
CC_CTL 
MCSR R/W 20F 
DC_FLUSH WwW 210 
DC_PERR_STAT R/W1C 212 | 
DC_TEST_CTL R/W 213 
DC_TEST_TAG R/W 
DC_TEST_TAG_TEMP R/W 
DC_MODE R/W 
MAF MODE 


3.4 PAL storage registers 


registers. The PAU shadows overlay Ri 
ICSR<SDE> is set. Thus, PALcode ¢:; 
registers cannot be written in the 
ment complete dirty logic on thes 


PALcode disables SDE for the efor error flows. 


e PALtemps are accessed with the HW_MTPR 
tf a PALtemp read to availability is one cycle. 





Access Internal Storage 


PALtemp /Ibox—PS /Mbox-DTB_CM 
/ Interrupt logic-IPL/ PALshadow 





PC - Tbox 

ASTEN R/W Interrupt logic-ASTER 
ASTSR R/W Interrupt logic-ASTRR 
IPIR WwW — 
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Table 3-7 (Cont.): 
Register Name 


OpenVMS SRM defined State 


Mnemonic 






Interrupt Priority Level IPL R/W 
Machine Check Error Summary MCES R/W 
Privileged Context Block Base PCBB R 
Processor Base Register PRBR 

Page Table Base Register PTBR 

System Control Block Base SCBB 


SW Interrupt Request Register SIRR 
SW Interrupt Summary Register SISR 


TB Check TBCHK 
TB Invalidate All TBIA 
TB Invalidate All Process TBIAP 
TB Invalidate All Dstream TBIAD 


TB Invalidate All Istream 
TB Invalidate Single 
Kernel Stack Pointer 
Executive Stack Pointer 
Supervisor Stack Pointer 
User Stack Pointer 
Virtual Page Table Base 


Who Am I 
Floating Point Enable 
Address Space Number 
Cycle Counter 
Unique 
lock_flag 
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PALtemp 


PALtemp / Ibox-IVPTBR/ Mbox— 
MVPTBR 


PALtemp 

Tbox-ICSR 

Tbox-ITB_ASN/ Mbox-DTB_ASN 
Mbox—CC,CC_CTL. Read with RPCC 
PCB 


Chox/System. Access with LDx_L 
and STx_C, and HW_LD and HW_ 
ST variants. 
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IPR Name 
Processor Status 


Program Counter 
Interrupt Entry Address 
Arith Trap Entry Address 
MM Fault Entry Address 
Unaligned Access Entry Address 
Instruction Fault Address 

Call System Entry Address 
User Stack Pointer 

Kernel Stack Pointer 

Kernel Global Pointer 

System Value 

Page Table Base Register 
Virtual Page Table Base 

Process Control Block Base 
Address Space Number 
Cycle Counter 

Floating Point Enable 
lock_flag 


Unique 
Who Am I 


Table 3-8: OSF SRM defined State 










Mnemonic Access 


PS 



























PS /Mbox—DTB_CM 
1c-IPL/ PALshadow 


PC 
entINT 
entARITH 
entMM 
entUNA 
entIF 


Tbox-IVPTBR/ Mbox-MVPTBR 
PALtemp 

Ibox-ITB_ASN/ Mbox-DTB_ASN 
Mbox—CC,CC_CTL. Read with RPCC. 
Tbox—ICSR 


Chox/System. Access with LDx_L 
and STx_C, and HW_LD and HW_ 


ST variants. 
R/W PCB 
R PALtemp 


ox support distinct trap entry points for single and double TBmiss. 


¢ The design f the interrupt hardware is specifically tailored to speed up OpenVMS CALL_ 


PALs like MTPR_IPL. 
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¢ The PALtemps have a 1 cycle latency. 
¢ The more frequent PAL trap entry points are grouped together to im 


3.8 TBmiss flows 


Figure 3-5: Istream TBmiss flow 



























Assumptions, info, etc. 
This is the entry for Istream TBmiss. A virt 

be attempted. If the virtual PTE fetch 

be taken to the double miss routine, wh 

the PTE fetch and HW_REI back to this ro 
Instruction pairs show EO/E1. 
Best case timing: 16 cycles (6 in, 8 execute, 


ITBMISS: 
nop 
mfpr r8, ev5$ ifault_va_form ; Ge PTE. 
nop 
mfpr rl0, exc_addr nstruction. 


ld_vpte r8, O(r8) traps te DIBMISS DOUBLE in case of TBmiss 


mtpr v1l0, exc_addr exc _¢ dress if there was a trap. 
mfpr r3l, ev5$_va : fase there was a double miss 
nop 

and r8, #ptesm_foe, rQs 

blbc r8, INVALID OR R@e é not valid. 


nop 
bne r25, INVALID 


nop 


mtpr v8, ev5$_itb pte Ibox remembers the VA, load the PTE into the ITB. 





hw_rei_ stall me, synch and return. 
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Figure 3-6: Dstream TBmiss flow 





Assumptions, info, etc. 

This is the entry for Dstream TBmiss (from native or PAlmode) 

A virtual fetch of the PTE will be attempted. If the vir 

PTE fetch TBmisses, a trap will be taken to the double : 

which will fill the TB for the PTE fetch and HW_REI back 
Instruction pairs show E0O/E1. 


Best case timing: 18 cycles (8 trap shadow, 9 execute, 1 out) 


DTBMISS SINGLE: 


mfpr r8, ev5$_va_form 3; Get virtual address 
mfpr r10, exc_addr 3; Get PC of faulting 
mfpr v9, ev5$ mm stat 7; Get read/write bit. 


mtpr rl10, pté Stash exc add 





~ 


ld_vpte r8, 0(r8} Get PTE, tra n case of TBmiss 


wee 


nop Pad MFPR VA 

mfpr y1l0, ev5$_va 7 Get origi load. 
nop 

mtpr r8, ev5$_dtb pte 7 Write DTB PTE 

blbc r8, INVALID DPTE HANDLER 7 Handle inval 


mtpr r10, ev5$_dtb tag 
mfpr r10, pté 





ompPletes DTB load. No virt ref for 3 cycles. 


™ 


Write DTB TAG part, 













ns take 2 cycles 


mtpr r1l0, exc_addr e we trapped. 


mfpr r3l, pto 


hw_rei 


3.9 IPRs 


This section describes, on 
Registers. Ibox, Mbox, ¢ 
HW_MFPR instructions. 
accessable in the physité 
the Chox, Scache, ar 


asis, all the DECchip 21164-AA Internal Processor 
e IPRs are accessable to PALcode via the HW_MTPR and 
lists the IPR numbers. Chox, Scache, and Beache IPRs are 
ion FFFFF00000 to FFFFFFFFFF. Table 3-30 summarizes 
ble 3-43 lists restrictions on the IPRs. 


3.9.1 


NOTE 


tated, IPRS are not cleared or set by hardware on chip or on timeout 
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3.9.1.1 ITB_TAG 


The ITB_TAG register is a write only register. This register is wr ware on an 


of the ITB, 


into the tag: field 
of the ITB location, which is determined by a NLU algorithm: is obtained from 


the MTPR ITB_PTE instruction. 


Figure 3-7: Istream TB Tag, ITB_TAG 





3.9.1. 


increments the NLU pointer, which 
The TAG field of the ITB location 
ction. Writes to this register use the memory 
y management chapter of the Alpha SRM. 





12 11 10 09 08 07 06 05 04 03 00 


$a-------- es pee tant inten tan te ct ene pentane nant 
| | e lu |S JE {KIT | \A | | 
| IGN | IGN IR |R IR [R {|G {| GH |S |} IGN | 
I | F IE JE |B JE IN | IM | | 

tone eran ee pont a pete te ntecn na ten tenn at 


two instructions. A read of the ITB_PTE register, returns the 
pointer to the ITB_PTE_TEMP register and updates the NLU pointer 
algorithm. A zero value is returned to the integer register file. A 
B PPE TEMP register returns the PTE the the general purpose integer 


according ¢ 
second re 


is bumped in trap shadows. 
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Figure 3-9: Istream TB PTE Read Format, ITB_PTE 








63 59 58 32 31 30 29 28 22 21 20 19 18. 








the current process. 


Figure 3-10: Address Space Number Read. 





3.9.1.4 ITB_PTE_TEMP 


The ITB_PTE_TEMP register is a 
ITB_PTE register returns data {6 
returns data to the integer gi 


Figure 3-11: [stream T: 






63 59 58 22 28 22 21 20 19 18 13 12 0 

tecce nena +o --~ gee. ~~ ae ----4-------- 4+---------- fae tententeintens stanton nae + 

| | | IU |S [BE IK | JA | | 

| RAZ | 20> | RAZ IR JR {R JR |RAZ ([S | RAZ | 
| 


1E |E |E |E | IM | 


Description 
1 RO Is set if GH(granularity hint) equals 11. 


RO Is set if GH(granularity hint) equals 10 or 11. 
RO Is set if GH(granularity hint) equals 01, 10 or 11. 
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3.9.1.5  Istream TB Invalidate All Process, ITB_IAP 


This is a write-only register. Any write to this register invalidates all es?’ whose ASM 
bit equals zero. 


3.9.1.6 [Stream TB Invalidate All, ITB_IA 


in order to initialize the NLU pointer. 


3.9.1.7 ITB_IS 


This is a write-only register. Writing a virtual ad 
meets any one of the following criteria: 


¢ An ITB entry whose VA field matches ITE: 
ASN<10:4>. 


¢ An ITB entry whose VA field matches ITB_IS<42: 


nvalidates the ITB entry that 


“whose ASN field matches ITB_ 


whose ASM bit is set. 


Figure 3-12: ITB_IS 





This is a read-only registe 


tain s the formatted faulting virtual address on an ITBMiss/IAC| 
IACCVIO’s generated ‘ 


rrors). The formatted faulting address generated depends 
‘enabled through the SPE <0> bit of the ICSR. 








03 02 00 
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Figure 3-14: IFAULT_VA_FORM in NT mode 





3.9.1.9 Virtual Page Table Base register, IVPTBR 


This is a read-write register. 











Figure 3-15: IVPTBR in non NT mode 


RAZ/IGN 


3.9.1.10 Icache Parity Error 


This is read/write register t : 
bits may be cleared by writi the appropriate bits. 


Figure 3-17: 


Extent Type Description 





11 W1C Data parity error occurred. 
TPE : 12 WIC Tag parity error occurred. 
13 WIC Timeout reset error or CFAIL_H/no CACK_H error occurred. 
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3.9.1.11. Cache Flush Control register, IC_FLUSH_CTL 
This is a write-only register. Writing any value to this register flushes t @ache. 


3.9.1.12 Exception Address eer misma 


ADDR register. This register can be written both by h 
happen as a result of exceptions/interrupts and CALLP; 
occur as a result of exceptions/interrupts take precedenc 


re. Hardware writes 
Hardware writes which 


In case of an exception/interrupt, hardware writes a PC td! egister in S6 of the execution 
pipeline. In case of precise exceptions, this is the P&f the in ion that caused the exception. 
In case of imprecise exceptions/interrupts, this is# 
issued if the exception/interrupt was not reporté 


In case of a CALLPAL instruction, the PC of 
EXC_ADDR in $5. Software writes of the register € fre HW_MTPR instruction also take 
place in S5. At a given time only a CALLPAL or HW* R instruction will attempt to write 
EXC_ADDR as both these instructions age slotted to the E1 pipe. 


BIT <0> of this register is used to indicate EA) #)n a HW_REI the mode of the machine 
is determined by BIT <0> of the EXG2ADDR'¥e: 


‘uctig#after the CALLPAL is written to 


Figure 3-18: EXC_ADDR Read/Wiite f 


The exception summ ister : 8 the different arithmetic traps that have occurred since the 
last time EXC_SUM F ¥ write to this register clears bits <16:10>. 






IT {I JU |F ID IT {s } 
> RAZ/IGN JO [IN IN {0 |Z IN |W { RAZ/IGN | 
IV JE JF IV JE Iv Ic | | 
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Table 3-11: EXC_SUM Field Descriptions 
Extent Type 





Name 





Description 


instructions that trappedsinee 
contained the /S modifier. The ‘ 








e /S modifier completed 
mains cleared regardless 


of instructions that have caused an arithmetic 
e destination is recorded as a single bit mask 
A write to EXC_SUM clears the EXC_MASK 


: : Healt register which contains the base address for PALcode. 
d by hardware on reset. 
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Figure 3-21: PAL_BASE 





3.9.1.16 Processor Status, PS 


The processor_status register is a read/write register es fhe: Girrent mode bits of the 
architecturally defined PS. e : 


Figure 3-22: Processor Status, PS 





RAZ/IGN 


3.9.1.17 box Control/Status Register, 4 


This is a read-write register whic ‘Ibox,related control and status information. 









Figure 3-23: Ibox Control/Status 


40 39 38 37 36 35: 
+-5----- +--4+--—+--+--+ ES +--+ SRR ete — + -- poe tencten tanto cnn een + 






| }T {IS|D |F ; SPE |H |F IT IT | | 
| IS IT IB IBA |A/G | [1:0] |W iP IM IM | RAZ/IGN | 
| IT |A [|S |D 12/N {E | JE JE [D [M | | 


foo onan tontan tengo te 


Table 3-12: ICSR 





Description 


RW,0 sdf set, the timeout counter counts 5K cycles before asserting 
timeout reset. If clear, the timeout counter counts 1 billion 
cycles before asserting timeout reset. 


RW,0 ~—siIf set, disables the ibox timeout counter. Does not affect 
CFAIL/no CACK error. 


RW,0 sf set floating point instructions may be issued. When clear 
floating point instructions cause FEN exceptions. 


RW,0 _—siIf set, allows PALRES instructions to be issued in kernel 
mode. 
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Table 3—12 (Cont.): ICSR Field Descriptions 


Description 


SPE 29:28  RW,0 IfSPE<1> is set, it enables ‘Saapping of istream vir- 


tual addresses VA<39: : cal address PA<39:13>, 






















if VA<42:41> = 10. Vi ress biE=:VA<40> is ie in 
this translation. Access 183i6w: 
SPE<0> when 
virtual addresse 
address PA<39: 


e mapping of istream 
‘E (Hex) directly to physical 





to PA<30:13>. Acc ed only in kernel mode. 
SDE 30 RW,0 If set, enables PAL ‘registers. 
CRDE 32 RW,0 | or interrupts 
SLE 33 RW,0 ae ee interrupts. 
FMS 34 RW,0 : if he references. 
FBT 35 RW,0 ne 
FBD 36 RW,0 If set, forces bad ‘feathe data parity. 
DBS 37 RW,1 “This bit control selection of the multiplexer for the debug 


ug port sees bits <11:4> of the siloed PC. 
acket from the MBOX is selected. 


ISTA 38 it indicates ICACHE BIST status. If set, 
IST was successful 
TST 39 #1 to this bit causes the TEST_STATUS_H pin of the 


asserted. 


3.9.1.18 Interrupt Priority L 


ining he value of the architecturally specified IPL register. 
Tupt whose target IPL level is greater than the value in 


id Register, INTID 


only register. It is written by hardware with the target IPL of the highest priority 
‘upt. The hardware recognizes an interrupt if this IPL is greater than the IPL 
given by IP >. Interrupt service routines may use the value of this register to determine 
the cause of the interrupt. PAL code, for the interrupt service, must ensure that the IPL level 
in INTID is greater than the IPL level specified by the IPL register. This restriction is required 
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because a level sensitive hardware interrupt may disappear before 
is entered (passive release). 


strupt does not 


The contents of INTID are not correct on a HALT interrupt, as ine 
s INTID indicates 


have a target IPL at which it can be masked. When a HALT 4 
the next highest priority pending interrupt. PAL code for inté 
to determine if a HALT interrupt has occured. 


Figure 3-25: Interrupt Id Register, INTID 





RAZ/IGN 


The Asynchronous System Trap Request Register is a read/write register which contains bits to 
request AST interrupts in each of the four des(USEK). In order to generate an AST 
st be set and the current processor mode 
» associated with the AST request. 





03 02 01 00 
=a --------------- tocteatentont 
|U |S JE IK | 
|A |A JA |A | 
IR JR IR IR | 
wert cn een nee + + RW --- --- -------- -------~+------- tonto t--t—-t 







3.9.1.21 AynchronousS5ys ‘Enable Register, ASTER 


Register is a read/write register which contains bits to enable 


03 02 01 00 

Re en ee ee ee en en enn fa mtententent 
|U |S {E JK 

RAZ/IGN IA |A JA JA | 
JE |E JE JE 

------- ------ ---- ------ --- -- -- -- -- -- 5 = facta ta-tant 
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3.9.1.22 Software Interrupt Request Register. SIRR 


Are interrupt 
opriate bit in 


The Software Interrupt Request Register is a read/write register used to cori 
requests. A software request for a particular IPL may be requested tting t 
SIRR<15:1>. 


Extent Type Descriptiot 





SIRR 18:4 RW Request software ifiterrupts. 


3.9.1.23 HW Interrupt Clear register, 


This is a write-only register, used {0 is sitive hardware interrupt requests. 


Figure 3—29: 


Table 3-14: 


Description 





Clears perf counter 0 interrupt requests. 
W1C Clears perf counter 1 interrupt requests. 
Wic Clears perf counter 2 interrupt requests. 
WIC Clears correctable read data interrupt requests. 


W1C Clears serial line interrupt requests. 
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3.9.1.24 Interrupt Summary register, ISR 


The Interrupt Summary register is a read only register which contains i t all pending 


hardware/software/AST interrupt requests. 


Figure 3-30: Interrupt Summary Register, ISR read format 








63 34 33 32 31 30 29 28 27 2624 23 22 21 20 19 18 


bonne tan tanto bon tent nte tanta nnn ten ton tentententensane 

| IH |S {Cc {M |P |P [P |P | JT |Z {rT JT IA I | 
[RAZ |L |L |R (c JF je Jc [Cc [| RAZl2 12 [2 |2 |T | SI 20>] 
| IT (I JD iK IL [2 |2 JO | 13. |2 {1 10 JR |} | 
| Wel ae ye Wee Ie Be ol be ie EL ect STER<3 20>] 
parr ton tanta tanto ntente ctente nna ten poet ntee tat = sige -—-- ---— SR ~~ -- ---- + 








Table 3-15: ISR read format Field Descriptigj 


ASTRRI[3:0] 
SIRR[15:1] 18:4 RO,0 


ATR 19 
120 20 externathardware interrupt at IPL 20. 
121 21 hardware interrupt at IPL 21. 


122 22 
123 238 


fnal hardware interrupt at IPL 22. 
‘external hardware interrupt at IPL 23. 


PCO xternal hardware interrupt - Performance counter 0 (IPL 
29). 

PCi External hardware interrupt - Performance counter 1 (IPL 
29). 

PC2 External hardware interrupt - Performance counter 2 (IPL 
29). 

PFL External Hardware interrupt - Powerfail (IPL 30). 

MCK External Hardware interrupt - system machine check (IPL 
31). 

CRD RO Correctable ECC errors (IPL 31). 

SLI RO Serial line interrupt. 


External Hardware interrupt - halt . 
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3.9.1.25 Serial line transmit, SL_XMIT 


The serial line transmit register is a write-only register used to transmit Bi ata off chip 
under the control of a software timing loop. The value of the TM s trans#tted off chip on 
the SROM_CLK_H pin. In normal operation mode (not in debug gress 


overloaded and serves both the serial line transmission and the I¢é interface. 


Figure 3-31: Serial line transmit Register, SL_XMIT 





ive bit-serial data under the control 
e SL_RCV 4 gister is functionally connected to the 
|{whenever a transition is detected on the 

ying normal operations (not in test-mode), 
erial line reception and the [Cache serial 


The serial line receive register is a read-only register used to 
of a software timing loop. The RCV bit ing 















ROM interface. 


Figure 3-32: 


07 06 05 
$o-----~--------------- aa - —- - ee -- -- -- -- pec toace cena eee n--H + 
IR | | 
Ic | | 
lvl 


3.9.1.27  Performané 


The Performance 


mance counters 
Register). 


13 
—------ Se 
R1<15:0>|SELO|Ku|CTR2<13:0>|CTLO|CTL1|CTL2| Kp|Kk {| SEL1|SEL2 | 
— foe maton tenn nn nn penn ton nn tenet enna pene tennnt 
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Table 3-16: PMCTR Field Descriptions 
Extent Type 




















Description 


CTRO[15:0] 63:48 RW 16 bit counter 
CTR1[15:0] 47:32 RW 16 bit counter 
CTR2[13:0] 29:16 RW 14 bit counter 


CTLO[1:0] 15:14 RW,0 CTRO counter control: 
00 counter disabléix:! 


q 65536 (16384) 

CTL1[1:0] 13:12. RW,0 
pt at freq 65536 (16384) 

CTL2[1:0] 11:10 RW 0 


errupt disable 
1 counter enable, interrupt disable 
saunter enawe, interrupt at freq 65536 (16384) 


SELO 31 y @ct - see Table 3-17 


SEL1[3:0] TA inter, Select - see Table 3-17 

SEL2[3:0] 3:0 “Select - see Table 3-17 

Ku 30 mode, Disables measurements in user mode 
Kp 9 


KEEPAL mode, Disables measurements in PAL mode 


Kernel,Executive, Supervisor mode. Disables measure- 
nts in Kernel, Executive, and Supervisor modes. 

u=1,Kp=1,Kk=1 allows measurements in Exec/Supervisor 
modes only 


Kk 8 3 


Table 3-17: 


Counter0 
Select0[0] 











Counter2 
Select2[3:0] 


0x0: Long(>15 cycle) Stalls 


Oxl: reserved 










5:Dual-Issue cycles 
0x6:Triple-Issue cycles 


:Quad-Issue cycles 
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Table 3-17 (Cont.): PMCTR Counter Select Options 


Counter0 Counter1 Counter2 
Select0[0] Select1[3:0] Select2[3:0] 

























1:Instructions Ox8:jsr-ret 0x2:PC-Mispre} 
0x8:Cond-Branch 


0x8:All Flow-change instructions 
if sel2=!(PC-M or BR-M) 


0x9:IntOps 
0xA:FPOps 
0xB:Loads 
OxC:Stores 
0xD:Icache 
OxE:Deache Accesses 


iDxL Instructions 
Pick CBOX input 2 










OxF:Pick CBOX input: QxF: 





3.9.2 Mbox and DcachesAPRs 


NOTE 


Traps are fa i IPR write operations unless noted otherwise. 
Unless explicitly 
reset. 


3.9.2.1 DTB 


The DTB.. 
with a : 
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Figure 3-34: DTB_ASN 


3.9.2.2 DTB_CM, Dstream TB Current Mode 


The DTB_CM register is a write-only register which, whe 
an exact duplicate of the Ibox Processor Status (IPS) ee : 
Current Mode of the machine. 


de, must be written with 
field. These bits indicate the 


Figure 3-35: DTB_CM 


2 
Se or oe 









3.9.2.3 DTB_TAG, 
The DTB_TAG xegis ts ite- ned register which writes the DTB ie and the contents of 


m Paces in hardware. A write to the DIB _ TAG register increments 
the DTB which allows writing the entire set of DTB PTE and TAG entries. 
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Figure 3-36: DTB_TAG, Dstream TB Tag 





3.9.2.4 Dstream TB PTE, DTB_PTE 


The DTB_PTE register is a read/write register represent: 
The entry to be written is chosen by a not-last-used algori 
to the DTB_PTE use the memory format bit positi® 
exception that some fields are ignored. In partic 


To ensure the integrity of the DTB, the PTE i 
transferred to the DTB until the DITB_TAG régz 
PTE and then reading without an intervening DT 
written to the DIT B_PTE register. 


4. As a result, writing the DTB_ 
‘will not return the data previously 


Reads of the DTB_PTE require two instraetions. First, @zead from the DTB_PTE sends the PTE 


a 


data to the DTB_PTE_TEMP register. 
a DTB_PTE read. A second instructi 
PTE entry to the register file. Rea 
of the DTB which allows reading 







11 10 09 08 07 06 05 04 03 O2 O01 00 
tom peepee ton ton penton teste ntente ston testa t 


Pile code hee i dt a aay ed Co | 


fa epanto nti n ten ten ten tectentententectent 








DIGITAL CONFIDENTIAL 


“returned to the integer register file on 
ne DITB_PTE_TEMP register returns the 
register increments the TB entry pointer 

























ignore 
FOR 
FOW 
ignore 
ASM 
GH<1:0> 
ignore 
KRE 
ERE 

SRE 
URE 

KWE 
EWE 

SWE 

UWE 
PFN<39:13> 
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3.9.2.5 DTB_PTE_TEMP 


63 39 13 12 10 09 08/07 06 05 04108 











UWE 
PFN<39, .13> 










When D-stream faults or 
and saved in the MM_ST, 
against further updates ; 
by hardware when the regis: 
Deache parity error i 


{ware reads the VA register. MM_STAT bits are only modified 
ot locked and a memory management error, DTB miss, or 


Figure 3-39: 
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Table 3-19: MMN_STAT Field Descriptions 


Extent Type 






















Description 


WR 0 RO ats Ss 

ACV 1 RO i in. Includes bad VA. 
FOR 2 RO d'the P¥i's FOR bit was set. 
FOW 3 RO Set if reference was a write and; TE’s FOW bit was set. 
DTB_MISS 4 RO Set if reference | ! 

BAD _VA 5 RO Set if reference hé t tt-address. 

RA 10:6 RO : 


OPCODE 


3.9.2.7 VA, Faulting Virtual Address 


When D-stream faults, DTB misses, or Dcache par ; occur the effective virtual address 
associated with the fault, miss, or error is latched in théread-only VA register. The VA, VA_ 


FORM, and MM_STAT registers are locked against further updates until software reads the VA 
register. The VA IPR is not unlocked orf: 


Figure 3-40: VA, Faulting VA Reg 


VA and the Virtua 


enhancement ti PBiniss PALflow. The VA is formatted as a 32-bit PTE when the 


NT_Mode bit oR t. VA_FORM is a one IPR, and is locked on any D-stream 
fault, DTB ; 


against fur: 







1 describes VA_FORM when MCSR<SP0> is clear. Figure 3-42 describes 
R<SPO> is set. 
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Figure 3-41: VA_FORM, Formatted VA Register for NT_Mode=0 





Table 3-20: VA_FORM Field Descriptions 





VA<42:13> riginal faulting Virtual Address, NT_Mode=0. 


VPTB 63:33 age Table Base address as stored in MVPTBR,NT_ 
VA<31:13> 21:03 the original faulting Virtual Address, NT_Mode=1. 
VPTB Page Table Base address as stored in MVPTBR,NT_ 





VA register, the 
or Deache parity 
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3.9.2.10 DC_PERR_STAT, Dcache Parity Error Status 


register. The VA, VA_LFORM and MM_STAT registers are lock 
software reads the VA ae If a Deache parity error is deteg$ 


writes a “one” to clear the LOCK bit. The SEO bit is set when a D 
the Dcache parity error status register is locked. Once, the SEO bit 
further updates until the software writes a "one" to DC. : 

bit. Note the SEO bit does not get set when Deache 


Set it is locked against 
to unlock and clear the 


cleared on reset. 







Figure 3-44: DC_PERR_STAT, Dcache Parity: 


06 05 04 03 02 01 00 
peer anon en n= 2 5 === enna nnn nnn penton toate nten tet 











Table 3-21: DC_PERR_STi 













Set if second Deache parity error occurred in a cycle after the 
register was locked. The SEO bit will not be set as a result 
of a second parity error that occurs within the same cycle as 
the first. 


LOCK Set if parity error detected in Deache. Bits <5:2> are locked 
against further updates when this bit is set. Bits <5:2> are 
cleared when the LOCK bit is cleared. 

DPO Set on data parity error in Deache bank 0. 

DP1 Set on data parity error in Deache bank1. 

TPO Set on tag parity error in Deache bank 0. 


Set on tag parity error in Deache bank 1. 
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3.9.2.11. Dstream TB Invalidate All Process, DTBIAP 


This is a write-only register. Any write to this register invalidates all DTS: 
ASM bit is equal to zero. : 


in which the 


3.9.2.12 Dstream TB invalidate All, DTBIA 


This is a write-only register. Any write to this register invalidate 
the DTB NLU pointer to its initial state. 


il 64. BEB entries, and resets 


3.9.2.13 DTBIS, Dstream TB Invalidate Single 


that meets any one of the following criteria: 


¢ A DTB entry whose VA field matches DT 
ASN<63:57>. 


¢ A DTB entry whose VA field matches DTBIS<# whose ASM bit is set. 


The DTBIS is writt ie 
operation will be aborted. lie IBOX only for the following trap conditions: ITB miss, 
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Figure 3-46: MCSR, Mbox Control Register 






















Table 3-22: MCSR Field Descriptions 
Name Extent Type 


M_BIG_ENDIAN 








nable. When set, bit 2 of the physical 


SP<1:0> 2:1 RW,0 


> 


is PA<29:13>, with bits <39:30> of physical address 
. SP<0> is the NT_Mode bit that is used to control 
VAtermatting on a read from the VA_FORM IPR. Superpage 
ccess is only allowed in kernel mode. 


Hebug Test Select. The DBG_TEST SEL<1:0> bits are used 
to control the Mbox/Cbox DECchip 21164-AA parallel test 
port mux selection. When DBG_TEST_SEL<1:0> = (00), the 
Chox pgc_pata<7:0> is selected. When DBG_TEST_SEL<1:0> 
= (01), the Mbox DCI debug packet is selected. When DBG_ 
TEST_SEL<1:0> = (10), the Mbox MAF_OUT debug packet 
is selected. When DBG_TEST_SEL<1:0> = (11), the debug 
packet selection is dynamically controlled by the state of the 
RFB_DATA VALID signal from the Cbox. (Need a refer- 
ence to the Mbox test packet signal description.) These 
bits are used for diagnostic and test purposes only. 


RW,0 ~=Ebox Big Endian mode enable. This bit is sent to the Ebox to 
enable Big Endian support for the EXTxx, MSKxx and INSxx 
byte instructions. This bit causes the shift amount to be in- 
verted (ones-complemented) prior to the shifter operation. 


Mbox debug packet select. See DBG_TEST_SEL<0>. 


DBG_TEST_SEL<0> 
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3.9.2.15 DC MODE, Dcache Mode Register 


The DC_MODE register is a read/write register that controls diagnostic 
Deache. This register is cleared on chip reset but not on timeoutreset. 


odes in the 


Figure 3-47: DC_MODE, Dcache Mode Register 


Table 3-23: 


Unless the Deache has been dis- 





ache enable. 


ream references will be forced to miss in the Dcache, 
tstanding fills will be blocked from filling the Deache. 


Deache force hit. When set, this bit forces all D-stream refer- 
ices to hit in the Deache. 


When set, this bit inverts the data parity inputs to the Deache 
on integer stores. This will have the effect of putting bad data 
parity into the Deache on integer stores that hit in the Dcache. 
This bit will have no effect on the tag parity written to the 
Deache or the data parity written to the CBOX Write Data 
Buffer on integer stores. Note: Floating point stores should 
NOT be issued when this bit is set because it may result in 
bad parity being written to the CBOX Write Data Buffer. 


When set, this bit disables Deache parity error reporting. 
When clear, this bit enables all Deache tag and data par- 
ity errors. Parity error reporting is enabled during all other 
Deache test modes unless this bit is explicitly set. 


DC_FHIT 1 


DC_BAD_PARITY 


DC_PERR_DISABL 


3-36 PALcode and IPRs DIGITAL CONFIDENTIAL 


DEC Chip 21164-AA (EV5 CPU) Specification, Revision x, October 1993 



















Table 3-23 (Cont.): DC MODE Field Descriptions 


Extent Type 





Description 


DC_DOA 4 RO Hardware Deache Di 
has been disabled unde} 
fuse resides in the MB 


(a programmable/readable 
command will be set 


Deashe. When 
under software 
register must be 


will only be supported in the following con: 


DC_ENA = 1 

DC_FHIT = 0 
DC_BAD_PARITY = 0 
DC_PERR_DISABLE = 0 


3.9.2.16 MAF_MODE, MAF Mode Register 


The MAF_MODE register is a ré 
Mbox Miss Address File. This regist 
reset. 


“that controls diagnostic and test modes in the 
¥'on chip reset. Bit<5> is also cleared on timeout 


Figure 3-48: MAF_MO 





4 3 2 1 0 
peepee nt teateat 
















DREAD _NOMERGE 
WB_FLUSH_ ALWAYS 
WB_NOMERGE 
MAF_NO_BYPASS 
WB CNT | " DISABLE 
MAF. ARB | DISABLE 
DREAD_PENDING {READ ONLY) 
WB PENDING (READ ONLY) 
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Table 3-24: MAF_MODE Field Descriptions 
Description 
DREAD_NOMERGE 0 RW,0 Miss Address File D 





: ; When set, this bit 


dress file. Any load that is*i8su # DREAD _NOMERGE 
is set will be forced to allocate a# ntry. Subsequent merg- 
ewed (even if DREAD_NOMERGE 






























WB_FLUSH_ALWAYS 1 RW.0 ’ write buffer to flush whenever 


disal ‘all merging in the write buffer. 


WB_NOMERGE 2 RW,0 
sued, wien WB_NOMERGE is set will be 


MAF _NO_BYPASS 3 RW,0 les Dread] bypass requests in the MAF 
juests will be loaded into the MAF pend- 


ing queue ered arbitration takes place. 
it disables the 64-cycle WB counter in the 
top entry of the WB will arb at low priority 
[E:Dx_L is issued or a second WB entry is made. 
s bit disables all Dread and WB requests in the 
r. WB_Reissue, Replay, Iref and MB requests are 
ad from arbitrating for the Scache. This bit is cleared 
meout and chip reset. 

indicates the status of the MAF Dread file. When set, 
re are one or more outstanding Dread requests in the MAF 
is. When clear, there are no outstanding Dread requests. 
This bit indicates the status of the MAF WB file. When set, 
there are one or more outstanding WB requests in the MAF 
file. When clear, there are no outstanding WB requests. 


WB_CNT_DISABLE 4 RW,0 


MAF_ARB_ DISABLE 5 


DREAD_PENDING 6 


WB_PENDING 





Bits <5:0>_of t IDE register are only used for diagnostics and test. For 


‘supported in the following configuration: 


3-38 PALcode and IPRs DIGITAL CONFIDENTIAL 


DEC Chip 21164-AA (EV5 CPU) Specification, Revision x, October 1993 





















3.9.2.17 DC_FLUSH, Dcache Flush Register 


3.9.2.18 ALT_MODE, Alternate mode 


ALT_MODE is a write-only IPR. The AM field specifies the ? 
HW_LD and HW_ST instructions. 


Figure 3-49: ALT_MODE 


Table 3-25: ALT Mode 
ALT_MODE<4:3> Mode 





00 Kernel 

O1 Executive 
10 Supervisor 
11 User 





3.9.2.19 CC, Cycle Counter: 


DECchip 21164-AA support counter as described in the Alpha SRM. The low half of the 
counter, when enabledg? wee each CPU cycle. The upper half of the CC register is the 
counter offset. CC<68:32> i; ritten’6n a HW_MTPR to the CC IPR; bits <31:0> are unchanged. 
CC_CTL<32> is ust lisable the cycle counter. The lower half of the cycle counter 
is written on a HW. _CTL IPR. 


The CC register,.is RPCC instruction as defined in the Alpha SRM (The RPCC 
instruction ret tite). The cycle counter is enabled to increment only 3 cycles after 
the MTPR G Ath. CC. CTL<32> set) is issued. This means that an RPCC instruction 
issued 4 cy#ls: eran I 
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Figure 3-50: CC, Cycle Counter Register 


3.9.2.20 CC CTL, Cycle Counter Control 


The CC_CTL register is a write-only register that is usé 
counter and to enable or disable the counter. Bits CC< 
CTL<31:4> on a HW_MTPR to the CC_CTL register, Bits C€ 
CC<63:32> are not changed. If CC_CTL<32> is s¥ 
counter is disabled. 





‘8 low 32 bits of the cycle 
written with the value CC_ 
> are written with zero; bits 
iter is enabled, otherwise the 


= 


Figure 3-51: CC_CTL, Cycle Counter Control 





63 33 32 31 5 4 3 0 


-> Count<31:4> 
coco ---> CC_ENA 


Table 3-26: 





Count<31:4> Cycle count. This value is loaded into bits <31:4> of the CC 


register. 


Cycle Counter enable. When set, this bit enables the CC reg- 
ister to begin incrementing 3 cycles later. An RPCC issued 4 
cycles after CC_CTL<32> is written will see the initial count 
incremented by 1. 


CC_ENA 


3.9.2.21 DC. 
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Figure 3-52: DC_TEST_CTL, Dcache Test TAG Control Register 


pete > IGN/RAZ 
ee ieee at > INDEX<12:3> 


Table 3-27: DC_TEST_CTL Field Descriptions 













Aen set, reads from DC_TEST_TAG 
Deache bankO and writes to DC_ 
o Deache bank0. When clear, reads 


BANK1 1 


INDEX : TCE Cah ex. This field is used on reads/writes from/to 





The DC_TEST_TAG register ® wri R used exclusively for test and diagnostics. 
he DC_TEST_CTL register is used to index into the 


of the Deache and loadad into"the:aC_TEST_TAG_TEMP IPR register. A zero value is returned 
to the integer regist : § set, the read is from Dcache bank0. Otherwise it is from 
Deache bank1. ‘ 

When DC_TEST_: he value written to DC_TEST_TAG is written to the Deache 


index referenced 
are affected byt 


e DC_TEST_CTL register. The tag, tag parity, and valid bits 
| parity bits are not affected by this write (use DC_MODE<DCc_ 
s). If BANKO is set, the write is to Deache bankO. If BANK1 is 
Sbank1. If both are set, the write will occur to both banks. 
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Figure 3-53: DC_TEST_TAG, Deache Test TAG Register 


13 12 11 


fac rc nnn ste eee femen cana - $$ tectent----= a pectenn nt 
| IGN | 1ot ot | IGN | 
$-------------------- foann nae == === tont--t--- $onten-nat 

| | 

{ to------- > TAG PARITY 


> OWO_VALID 
> OW1_VALID 
Rute soon > TAG<38:13> 


Extent Type 


TAG_PARITY 2 wo rs to the Deache tag parity bit which 
th 13 (valid bits not covered). 

OW0_VALID 11 WO This bit refers to the Deache valid bit 
for the low order octaword within a Deache 32B block. 

OW1_VALID 12 Wo t 1. This bit refers to the Deache valid bit 


TAG 


- undefined value is re 
TEMP register will ré 
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Figure 3-54: DC_TEST_TAG_TEMP, Dcache Test TAG Temp Regi 


13°12 11 


TAG_PARITY 
DATA_PARO<0> 
DATA_PARO<1> 
DATA_PAR1<0> 
DATA_PAR1<1> 
owO_VALID 

: OW1_VALID 

both ee hee ate es TAG<38:13> 




















Name Extent Type 


TAG_PARITY 2 RO 


Description 





it refers to the Deache tag parity bit which 
through 13 (valid bits not covered). 


DATA, PARO<0> 3 


DATA_PARO<1> 4 


covers the upper longword of data indexed by DC_ 
TL<INDEX>. 


Dats Parity. This bit refers to the Bank1 Deache data parity 
it, which covers the lower longword of data indexed by DC_ 
ST_CTL<INDEX>. 


Data Parity. This bit refers to the Bank1 Dcache data parity 
bit which covers the upper longword of data indexed by DC_ 
TEST_CTL<INDEX>. 


DATA_PAR1<0> 5 


DATA_PAR1<1> 


OW0_VALID Octaword valid bit 0. This bit refers to the Deache valid bit 
for the low order octaword within a Deache 32B block. 
OW1_VALID Octaword valid bit 1. This bit refers to the Dcache valid bit 


for the high order octaword within a Deache 32B block. 


Tag<38:13>. This refers to the tag field in the Dcache array. 
(Note: Bit 39 is not stored in the array) 


TAG 
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3.9.3 Cbhox IPRs 


DECchip 21164-AA specific IPRs for controlling Scache, Beache 
ging error information are listed below. These IPR’s cannot be 2 ‘Om the system. 
These IPRs have been placed in the 1MB region of DECchip 22#64-AA; fe. 1/O address space 
ranging from FFFFF00000 to FFFFFFFFFF. Any read or writ R in this address 
space will produce UNDEFINED behavior. The operating syste 
in this region as writeable in any mode. 





Table 3~30: CBOX_IPRS Descriptions 
Register Address 





SC_CTL FF FFFO 00A8 "Scache behavior. 

SC_STAT FF FFFO 00E8 ache related errors. 

SC_ADDR FF FFFO 0188 ns the address for Scache related er- 

BC_CONTROL FF FFFO 0128 (W) Controls Beache/System Interface and Bcache 
testing. 

BC_CONFIG FF FFFO 01C8 Contains Beache configuration parameters. 

BC_TAG ADDR FF FFFO 0108 Contains tag and control bits for fills from 


Beache. 

EI_STAT FF FFFO 01; Logs Beache/system related errors. 
EIADDR FF FFFO 0148" j Contains the address for Bcache/system re- 
lated errors. 

FILL_SYN FF FF Contains fill syndrome or parity bits for 
fills from Beache/memory. 


LD_LOCK Contains the address for LDx_L commands. 


3.9.3.1 Scache Contr 
SC_CTL is a rea 





SC_FHIT 

SC_FLUSH 

SC_TAG STAT<5:0> 
SC_FB_DP<3:0> 
SC_BLK_SIZE 
SC_SET_EN<2:0> 
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Table 3-31: SC_CTL Field Descriptions 









Description 


SC_FHIT 0 (RW,0) When set, this bit ca: 
ST’s to hit in the 
bits. Non-cacheabl 

the Scache and will } 

one Scache set may be enablg; 
parity checki disab 

For STx, v: j 

SC_TAG_STAE 

Scache tag will 

tags will 


rce catheable Ld’s and 
e of the tag status 
t be forced to hit in 
. In this mode, only 
Scache tag and data 












“sed to write the Scache tag. 
ax address received by the Chox. 
tten with the STx address. SC_ 
‘on reset. 





SC_FLUSH 1 (RW,0) “valid bits in the Scache will be 
cl ‘CTL ipr is written. SC_FLUSH bit 
wi 

SC_TAG_ STAT 7:2 (RW) is bi only used in the SC_FHIT mode to 


This*fgittan be used to write bad data parity for the se- 
tected LW’s within the OW when writing the Scache. If 
of these bits is set to one, then the corresponding 
computed parity value will be inverted when writing 


SC_FB_DP 11:08 


Scache writes, the Cbox allocates two consecutive cy- 
to write up to two OW’s based on the LW valid bits 
received from the Mbox. Therefore, the same LW parity 
control bits will be used for writing both OWs. For exam- 
ple, Bit 8 corresponds to LWO and LW4. This bit field will 
be cleared on reset. 


This bit can be used to select the Scache and Beache block 
size to be either 64 byte or 32 byte. The Scache and Beache 
will always have identical block size. All the Beache and 
memory fills or write transactions will be of the selected 
block size. At the power up time this bit will be set and the 
default block size will be 64 byte. When clear, the block 
size will be 32 byte. This bit must be set to the desired 
value to reflect the correct Scache/Bcache block size before 
DECchip 21164-AA does the first cacheable read or write 
from Beache or system. 


SC_BLK_SIZE 
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Table 3-31 (Cont.): SC CTL Field Descriptions 


Description 










SC_SET_EN 15:13 (RW,1) This field will be use# 
or all three sets may 
combination of two & 
behavior. 











g away their fuses. "Fuse 
t fl ‘ either all sets or one set. 
Any write to ertabh #iainetitly disabled set will have no 
effect. Power-up eon 








Table 3-32: SC _TAG_STAT Field Description 
Scache Tag Status<7:2> Description 


SC_TAG_STAT<7:4> Tag Pari 
SC_TAG_STAT<3:2> 





irty; bits 7, 6, 5, 4 respectively 


leared or unlocked by reset. Any PAL code read 
and clears SC_STAT. 


+---> SC_TPERR<2:0> 
ee > SC_DPERR<7:0> 
Hane aee Hoe Ses eee ree > CBOX_CMD<4:0> 
ee aa ee > SC_SCND_ERR 
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Table 3-33: SC_STAT Field Descriptions 


















Description 


SC_TPERR 2:0 (RO) These bits, when set: 





SC_DPERR 10:3 (RO) 
if any LW wiki d from the Scache during 
lookup had a ¢ Bit 3 corresponds to LWO 
as shown i in the 

CBOX_CMD 15:11 (RO) 


SC_SCND_ERR 16 : 
| a parity error while the SC_TPERR or SC_ 
was already set from the earlier transaction. 





Table 3-34: SC_CMD Fiel 
SC CMD Source<15:14> é 








oding<13:11> 






Description 
Set Shared from System 
Read Dirty from System 
Invalidate from System 


Scache Victim 















Scache I-read 
Scache D-read 
Scache D-write 


4 set. If an Scache tag or data parity error is detected, then this register gets locked 
prevetits rther updates. This register is unlocked whenever SC_STAT is read. 
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For Scache Reads the address bits <39:4> are valid to identify the addres ing riven to the 
Scache. Address bit <4> identifies which OW was accessed first. For eat 
one tag access and two data access cycles. If there is a hit, two OWs: j 
CPU cycles. Tag parity error is detected only while reading the:frst OW, er, data parity 
error can be detected on either of the two OWs. : 


If SC_CTL<SC_FHIT> is set, SC_ADDR is used for storing th sand s 
tag in the Scache, there are unique valid, shared, dirty and mddify b 
basis. Tag and parity bits are common for both sub-blocks,, In Force AE 
probes will load the SC_ADDR register. The Scache wi : 

which is enabled. In this mode, tag and data parity chee 
SC_STAT iprs are not locked on a error. 


s bits. For each 
a sub-block (32B) 
“miode, ONLY reads and 
and status from the set 


SC_ADDR or SC_STAT. 


Normal Mode: 
63 40 39 38 


| RAO | Oo] SC_ADDR 
pone re renee een nee penton 5-5 


Force Hit Mode: 
63 39 38 


Se SC TAG PARITY 
Soa tee eee TAG STATUS FOR SUB-BLOCKO 
TAG STATUS FOR SUB-BLOCR1 
OW’s MODIFIED FOR SUB-BLOCKO 
OW’ s MODIFIED FOR SUB-BLOCK1 


= sc TAG 
RAO --> Read As One ‘ 
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63 27 26 25 24 191817 16 15 14 13 12 08 07 06 05 04 03 0 
prem t en ten ton pon na penn tect cn pos a tn en tte teste atest 
| ha! a | peed. of {TP|TCP|TViTSITD] | | ft | 4 
ee a oo 

| | { | Io | | 
| | 



















EI_CMD_GRP1 
EI_CMD_GRP2 
CORR_FILL_DAT 
VIM_FIRST 
eone EI_ECC_OR_PARITY 
z BC_FHIT 
BC_TAG STAT<4:0> 
BC_BAD_DAT 
EI_DIS_ERR 
anna {TL_PIPE_LATCH 
----- > BC_WAVE<1:0> 
wanes > PM_MUX_SEL<5:0> 
eee > DBG_MUX_SEL 
DIS_BAF_BYP 
DIS_SC_VIC_BUF 


| | 
| 1 | 
| on 
| an) 
| | 
| 1 | 
| tof 
| | 

i + 


















| 
| 
| 
| 
| 
| 
\ 
| 
| 
| 
i 
| 
| 
| 
| 
| 


Field 
BC_ENABLED 0 


ALLOC_CYC 1 


; local stycle for non-cacheable LDs. When set, the issue 
ll not allocate cycle for non-cacheable LDs. 


is bit must be clear before reading any Chox IPR. It can 
e clear when reading all other IPR’s and non-cacheable 
LDs. This bit will be clear on reset. 


When set, the optional commands, LOCK and SET DIRTY 
will be driven to the DECchip 21164-AA external interface 
command pins to be acknowledged by the system inter- 
face. When clear, it is unpredictable if these commands 
will be driven to the command pins. However, system 
should never CACK these commands if this bit is clear. 


When set, the MB command will be driven to the DECchip 
21164-AA external interface command pins to be acknowl- 
edged by the system interface. When clear, it is unpre- 
dictable if MB command will be driven to the command 
pins. However, system should never CACK the command 
if this bit is clear. 


EI_CMD_GRP2 2 


EICMD_GRP3 
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Table 3-35 (Cont.): BC_CONTROL Field Descriptions 


Description 












CORR_FILL_DAT 4 (WO,1) mory, in ECC mode. 


ache or memory will 
‘before being driven 


detected. If thé ig correctable, corrected data will be 
returned again, i Will be invalidated, and error trap 


VTM_FIRST 5 (WO,1) 


dress and command. Cleared for sys- 
‘buffer. If clear, ona Beache miss with 


EI]_ECC_OR_PARITY 6 (Wi 







rmines whether to operate the external inter- 
‘ -W ECC or Byte parity mode. When set, DECchip 
21164 AK Pe orca QW ECC on the data check 


BC_FHIT 7 Fache force hit. When this bit is set and the Bcache is 
enabled, all references in cached space are forced to hit in 
the Beache. Fill to the Scache will be forced to be private. 
Software should turn off BC_CONTROL<2> to allow clean 


to private transitions without going to the System. 


For STx, value of status, parity and tag bits specified by 
BC_TAG_STAT field will be used to write the Beache tags. 
Beache tag and index will be the STx address received by 
the BIU. It will write the Beache tag RAM’s with the STx 
address minus the Bcache index. BC_FHIT bit will be 
cleared on reset. 


This bit field can be only used in BC_FHIT mode to 
write any combination of tag status and parity bits in the 
Beache. The parity bit can be used to write bad tag parity. 


These bits will be undefined on reset. See Table 3-36 for 
the encodings. 
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Table 3-35 (Cont.): BC_CONTROL Field Descriptions 
Field Type 


BC_BAD_ DAT 14:13 (WO,0) When set, this fiel san Le write bad data with 
correctable or uncor C #CC mode. When bit 


Extent Description 





bit <14> is set, data bi#edS 
the same OW is read from 


are inverted. When 

Beache, DECchip 21164- 

ectable ECC error on both 

ef bits <14:13> used when 
ed on reset. 


the DECchip 21164-AA to ignore 
m a fill data received from the 











EI_DIS_ERR 15 (WO,1) 


TL_PIPE_LATCH 16 (WO,0) causes DECchip 21164-AA to pipe the 


BC_WAVE 18:17 will determine the number of cycles of wave 


hat should be used during private reads of the 


le wave pipelining, the BC_RD_SPD should be set 
He latency of the Bcache read. BC_CONTROL<18:17> 
wuld be set to the number of cycles to subtract from 
BC_RD_SPD to get the Beache repetition rate. For ex- 
ample, if BC_CONFIG<BC_RD_SPD> is set to 7 and BC_ 
CONTROL<18:17> is set to 2, it will take 7 cycles for valid 
data to arrive at the pins, but a new read will start every 
5 cycles. 

The read repetition rate must be greater than 3. For ex- 
ample it is not permitted to set BC_CONFIG<BC_RD_ 
SPD> to 5 and BC_CONTROL<18:17> to 2. 


This bit field is used for selecting the BIU parameters 
to be driven to the two performance monitoring counters 
in the Ibox. See Table 3-37 for the encodings. See the 
Performance Monitoring chapter for the detailed function- 
ality. On power-up, this field will be initialized to a value 
of 0. 


This bit field is used for selecting the first group of 8 Cbox 
signals driven to the Mbox for debug purpose. See XX 
chapter for the details of these signals. On power-up, this 
field will be initialized to a value of 0. 


PM_MUX_SEL 
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Table 3~35 (Cont.): 






DIS_BAF BYP 26 (WO,0) 


DIS_SC_VIC_BUF 27 (WO,0) 


Beache Tag Status<12:8> 


BC_TAG_STAT<12> Parity 
BC_TAG_STAT<11> ; 
BC_TAG_STAT<10> B 


BC_TAG_STAT<9> 
BC_TAG_STAT<8> 


Table 3-37: PM_MUX | 
PM_MUX_SEL<21:19> 








Ox1 


0x3 
Ox4 
0x5 cache References 
Bcache Victims 


System Requests 


BC_CONTROL Field Descriptions 


Description 


When set, speculatiyi 
Scache READs are 
READ-MISS or 
Index pins will changé"4 r 
When clear, if reads are hiti 


there is no 
the Beache 


Oxi 
Ox2 
0x3 


Ox4 
0x5 


0x6 
Ox7 


PM_MUX_SEL<24:22> 


















are disabled while 
here is no pending 
Thus, the Beache 
cache READ speed. 
the Scache and and 
or WRITE in the BIU, 
ge every other cpu cycle. 
nitialized to a value of 0. 


ache victim buffer will disabled. 
ating a dirty block will write 


Counter 2 


Scache Misses 
Scache Read Misses 
Scache Write Misses 
Scache Shared Writ: 


Scache Writes 


Beache Misses 


System Invalidates 
System Read Reque 
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3.9.3.5 Bcache Configuration Register, BC_CONFIG 


The Beache configuration register is write only. 


Table 3-38: BC_CONFIG Field Descriptions 


Description 


BC_SIZE 2:0 (WO,1) This field is used to indicate; 
power-up, thigsfield will be ffittitilized to a value of 1MB 
Beache. Se ‘ encodings. 





RESERVED 3 (WO,0) 
BC_RD_SPD TA (WO,4) 


e(be_rd_spd - bc_wrt_spd) < 4 


used to indicate to the BIU the write time of 
He, measured in CPU cycles. The Bcache write 
syaust be with in four to ten CPU cycles. On power- 


BC_WR_SPD 11:8 


*“@qual to SYS clock to CPU clock ratio. 


Enis field is used to indicate to the BIU the number of 
CPU cycles to wait when switching from a private read to 
a private write Bcache transaction. For other data move- 
ment commands, such as Read Dirty or Fill from memory, 
it is up to the system to direct system wide data movement 
in a way that is safe. ONE must be the minimum value 
for this field. 

BIU will always insert 2 CPU cycles between private 
Beache reads and private Beache writes in addition to the 
number of CPU cycles specified by this field. The maxi- 
mum value should not be greater than the Beache READ 
speed when Beache is enabled. 

On power-up, this field will be initialized to a read/write 
spacing of seven CPU cycles. 


Must Be Zero. 


BC_RD_WR_SPC 
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Table 3-38 (Cont.): BC_CONFIG Field Descriptions 


Description 


















the Sysclock edge, 
es not affect private 
ls from the system, 
the number of CPU 
write pulse value as 
field. 

4 value in the range of one 
to seven CPU ; never exceed the sysclock 
ck ratio is 3, this field must not 
wer-up, this field is initialized to 
ie CPU cycle. 


FILL_WE_OFFSET 18:16 (WO,1) Beache write enables 


when writing the Beack 
cycles to wait before driving, 


be larg ht 3.) 
a writes 
RESERVED 19 (WO,0) 


BC_WE_CTL 28:20 (WO,0) 


the write pulsés not asserted. Each bit corresponds to 
a CPU cyclegsAt the start of a Beache write cycle, write 

: i ys be de-asserted for one CPU cycle. After 
te, bit <20> of the register is used to assert the 
. Each cycle, the next bit will be used to assert 


RESERVED 
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3.9.3.6 External Interface Status eee sae 


unlocks and clears it. Read of EI_STAT will also unlock EL 
registers subject to some restrictions listed below. This regis! 
reset. 


Fill data from Beache or memory could have correctable (c) or 
mode. In parity mode, fill data parity errors are treated as uncorrectiéi 
address/emd parity errors are always treated as uncor: 
mode. The sequence for reading, unlocking, and cleari 
ELSTAT are as follows: 


1. Read EI_LADDR, BC_TAG, FILL_SYN: Can be 
register. 


2. Read EISTAT register: Reading of this register wilfinlock, EI ADDR, BC_TAG, FILL_SYN 
registers as described below. EI_STAT wi, ; and cleared on read subject to 
conditions listed below. ; 


Action when EI_STAT 


Lock Reg is read 





0 0 not possk ; no clear and unlock ev- 
erything 

1 0 not pi no clear and unlock ev- 
erything 

0 1 yes clear and unlock ev- 
erything 

11 1 yes yes clear (c) bit don’t un- 
lock. Transition to 
(0,1,0) state. 

0 1 no already locked clear and unlock ev- 
erything 

1} 1 no already locked clear (c) bit don’t un- 
lock. Transition to 


(0,1,1) state. 


1 These are special o& 
locked. By the tim 
of EI ADDR read 
cleared and regist® 
are in (0,1,x) st 


ible that when EI_ADDR was read, only correctable error bit is set and registers are not 
orrectable error is detected and the registers get loaded again and locked. The value 
é lid. Therefore, for the (1, 1, x) case, when EI_STAT is read correctable error bit is 

nlocked or cleared. Software must re-do the i ipr read sequence. On the second read, error bits 
iprs are unlocked, and EJ_STAT is cleared. 


ters are neither loaded nor locked. 
re locked on first uncorrectable error except the second hard error bit. 
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¢ The second hard error bit is set ONLY for an uncorrectable err 
error. If correctable error follows an uncorrectable, it will not be 
Note that Bcache tag parity errors are uncorrectable in this ¢ 


orrectable 
cond error. 


EI STAT Register 


63 36 35 34 33 32 31 30 29 28 27 0 
fono---------- = ----- == ------- pon tantan penta tontentento--=+ 
| RAO |RO| RO} RO|RO|RO|RO|RO| RO} RAO| 
fone n= --------------------- fon tentan teeta tonteat--t---=+ 


RAO --> Read As One 





Table 3-41: El_STAT Field Descriptions 


BC_TPERR 
e tag address RAM. 


BC_TC_PERR 29 RO é <vehen set, indicates that a Beache read encountered 


ELES 30 ixtersal interface error source. This field indicates if the error 


command parity error. When set, it indicates that the 
r source is memory or system. If not set, it is Beache. 


rrectable ECC error. This bit, when set, indicates that a fill 
data received from outside the CPU contained a correctable 
ECC error. 


COR_ECC_ERR 31 


UNC_ECC_ERR Uncorrectable ECC error. This bit, when set, indicates that 
a fill data received from outside the CPU contained an uncor- 
rectable ECC error. In the parity mode it indicates data parity 
error. 

EI_PAR_ERR External Interface address/command parity error. This bit, 


when set, indicates that an address and command received by 
the CPU has a parity error. 


FIL_IRD This bit is only meaningful when one of the ECC or parity 
error bits is set. FIL_IRD is set to indicate that the error 
which caused one of the error bits to get set occurred during 
an I-ref fill and clear to indicate that the error occurred during 


a D-ref fill. 


Second external interface hard error. This field indicates that 
the fill from Beache or memory or the system address/command 
received by the CPU has a hard error while one of the hard 
error bit in the E]_STAT register is already set. 


SEO_ 
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3.9.3.7 External Interface Address Register, EL ADDR 


ciated with errors reported by EI_STAT register. Its content is 
error bits is set. Read of EI_STAT unlocks EIADDR register. 










the results of every Bcache tag read. When a ta 
locked against further updates. Software may r 
specific I/O space address instruction. This register 
read. This register is not unlocked by reset. 


arity error occurs this register is 
by using the DECchip 21164-AA 


RAO --> Read As One 


Unused tag bits in th 


other correctable error does not re-load it. It gets loaded and locked if 
: r parity error is recognized during a fill from Beache or memory as 
. The FILL_SYN register is unlocked when the EL_STAT register is read. 
nlocked by reset. 









for a list of sy fromes associated with correctable single-bit errors. 
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If the chip is in parity mode and a parity error is recognized dusttt 
the FILL_SYN register indicates which of the bytes in the octawor 
SYNDROME‘(7..0] is set appropriately to indicate the bytes within, 





Data Bit Syndrome(Hex) 
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Table 3-42 (Cont.): Syndromes For Single-Bit Errors 
Data Bit Check Bit 





Syndrome(Hex) 
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Table 3-42 (Cont.): 
Data Bit 


Syndromes For Single-Bit Errors 
Syndrome(Hex) Check Bit Syndrome(Hex) 












3.9.3.10 Load Lock Register, LD_LOCK 


The Load Lock register is read only. It can be read by PA 


nostic purpose. It is not 
cleared by reset. 


ves LDx_L command and the 
t/miss or any parity error in 


RAO -->Read As One 


3.9.4 PAL Restrictions 


3.9.4.1. Definitions 


Y if checked 


ictions(note:numbers refer to cycle number): by PVC: 





§ HW_REI or HW_REI_STALL in cycle 0 
No MFPR EXC_ADDR in cycle 0,1 

" No HW_REI or HW_REL STALL in 0, 1 
PAL must slot to EO 





No other Mbox instruction in 0 


No other virtual reference in 0 


No Mbox MTPR or MFPR in 0 Y 
No MFPR MAF_MODE in 1,2 Y 
No MFPR DC_PERR_STAT in 1,2 Y 
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Table 3-43 (Cont): 


The following in cycle 0: 


Any Store instruction 
Any Virtual Mbox instruction 


Any Mbox instruction or WMB, if 
it traps 


Any Ibox trap except pc mispred, 
itbmiss, or OPCDEC due to user 
mode 


HW_REI_STALL 


MTPR any undefined IPR num- 
ber 


ARITH trap entry 
Machine_check trap entry 


MTPR Any Ibox IPR (including 
PALtemps) 


MTPR ASTRR, ASTER, SIRR, SICR 


MTPR EXC_ADDR 
MTPR IC_FLUSH_CTL 
MTPR ICSR: HWE, FPE 
MTPR ICSR: SPE, FMS 


MTPR ICSR: SPE 
MTPRICSR: SDE * 


DIGITAL CONFIDENTIAL 


PAL Restrictions Table 











Yif checked 
by PVC: 


Restrictions(note:numbers refer to 






























No MFPR DC_TEST_TAG slotted in 0 


No MFPR DC_PERR_STAT in 1,2 Y 
No MTPR DTBIS in 1 Y 
MTPR any Ibox IPR not aborte 
(except that EXC_ADDR is up “iting PC) 
MTPR DTBIS not aborted in 0,1 Y 
MTPR DTBIS not aborted 
Only 1 HW_RELS of 4 instructions 
Illegal in any cycle 
in cycle 0,1 
2,3,4,5,6,7 
K in cycle 0,1 
Y 
Y 
Y 
Y 
1_STALL, then no HW_REI_STALL in 0,1 Y 
3 iL, then no HW_REI in 0,1,2,3,4 Y 
(fush Icache 
#4 Lshadow read/write in 0,1,2,3 
‘HW_REI in 0,1,2 Y 
ust be followed by HW_RELSTALL 
* No HW_REL STALL in cycle 0,1,2,3,4 Y 
Must be followed by HW_REI STALL 
Must be followed by HW_REISTALL 
HW_REL STALL must be in the same Istream octaword 
No MFPR IFAULT_VA_FORM in 0,1,2 Y 
No CALL_PAL in 0,1,2,3,4,5,6,7 Y 
No HW_REI in 0,1,2,3,4,5,6 Y 
No HW_REI in 0,1,2 Y 
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Table 3-43 (Cont.): PAL Restrictions Table 
Yif checked 
The following in cycle 0: Restrictions(note:numbers refer to c by PVC: 
No priv. CALL_PAL in 0,1,2,3 
MTPR CC, CC_CTL No RPCC in 0,1,2 Y 
MTPR DC_FLUSH No Mbox instructions in 1,2 Y 
No outstanding fills in 0 
MTPR DC_MODE No Mbox instructions in 1,2,3, Y 
No MFPR DC_MODE in 1,2 Y 
No outstanding fills in 0. 
MTPR DC_PERR_STAT No load or store instru, Y 
No MFPR DC_PERR Y 
MTPR DC_TEST_CTL No MFPR DC_TEST Y 
No MFPR DC_TEST_CTL dtted in 1,2 
MTPR DC_TEST_TAG No outstanding DC fills in 0 
Y 
MTPR DTB_ASN Y 
MTPR DTB_CM, ALT_MODE Y 
MTPR DTB_PTE Y 
Y 
MTPR DTB_TAG Y 
Y 
Y 
Y 
MTPR DTBIAP, DTBIA Y 
MTPR DYBIS in 0,1,2 Y 
MTPR DTBIA PR DTB_PTE in 1 Y 
MTPR MAF_MODE ox instructions in 1,2,3 Y 
WMB in 1,2,3 Y 
‘No MFPR MAF_MODE in 1,2 Y 
"No virtual Mbox instructions in 0,1,2,3,4 Y 
No MFPR MCSR in 1,2 Y 
No MFPR VA_FORM in 1,2,3 Y 
No MFPR VA_FORM in 1,2 Y 


No outstanding DC fills in 0 
No MFPR DC_TEST_TAG_TEMP issued or slotted in 1 
No LDx instructions slotted in 0 
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Table 3-43 (Cont.): PAL Restrictions Table 


The following in cycle 0: Restrictions(note:numbers refer to cy¢! j by PVC: 
. No MTPR DC_TEST_CTL between Mer De 


MFPR DC_TEST_TAG_TEMP 


MFPR DTB_PTE No Mbox instructions in 0,1 Y 
No MTPR DC_TEST_CTL, DC_TEST. TAG in 0 Y 
No MFPR DTB_PTE_TEMP iss 38 
No MFPR DTB_PTE in 1 Y 
Y 


No virtual Mbox instructions in 0,1,2° 


MFPR VA Must be done in ARITH: 
UNALIGN, DFAULT 


‘DTBMISS_SINGLE, 
after the VPTE load 








Table 3-44: Cbox IPR Restrictions Table 


Store to SC_CTL, BC_CTL, BC_ Must be preceded by MB 
CONFIG except ifno bit ischanged Must be folld’ L by MB 
other than: 
BC_CTL<ALLOC_CYC>, 
BC_CTL<PM_MUX_SEL>, or 
BC_CTL<DBG_MUX_SEL> 


Store to BC_CTL that only changes 
bits: 

BC_CTL<ALLOC_CYC>, 
BC_CTL<PM_MUX_SEL>, or 
BC_CTL<DBG_MUX_SEL> 


Load from SC_STAT 

















“references 





Load from EI_STAT 

Any Chox IPR address 

Any undefined Cbox IPR E : 

Scache or Beache in fo No Stx_C to cacheable space 

Clearing of SC_FHIT ; - followed by MB, read of SC_STAT, then MB prior to subsequent store 


Clearing of BC_FHIT in 





be followed by MB, read of EI_STAT, then MB prior to subsequent store 
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Chapter 4 


External Interface 


4.1 Chip interface 


Figure 4-1: DECchip 21164-AA System Interface 


CMD He<3:0> SYSTEM 
ADDR H<39:4 MEMORY 


1/O 


BCACHE DATA 


SHARED, DIR PAR y 


FILL _H 


FILL JOH 
FILt DONE EARLY H 
Fit, ERROR H 


BACK _H 
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4.1.1 Overview . 


The DECchip 21164-AA chip is contained in the 503 pin package. All of the extra pins, compared 
to EV4, are used for power and ground. This means that the system interface will remain a 128 
bit bi-directional data bus. The only way to improve the bandwidth of the system interface is to 
cycle it faster and to use it more often. 


The cycle time of the system interface will be some integer multiple of the DECchip 21164-AA 
cycle time. The minimum multiple is 3x. The maximum multiple is 15x. The tested points 
between the min. and max. are TBD. The DECchip 21164-AA team will focus on the testing of 
values that our SYSTEM partners plan to use. Some testing of all possible values will be done. 


DECchip 21164-AA can be used to build systems with or without a module level Beache. The 
read and write speed of the Beache can be programmed independantly of the Sysclock ratio and 
each other. Some care must be taken to make fills and read/read dirty transactions work. The 
cache system supports a block size of 32 bytes or 64 bytes. The block size is selected by mode bit. 


Section 9.1 lists the DECchip 21164-AA signal pins. Figure 4—1 shows a simple picture of the 
system interface. 


Chapter 8 describes the AC requirements for DECchip 21164-AA. 


« DECchip 21164-AA can take one command/address from the SYSTEM at a time. The Scache and/or 

* Beache will be probed to determine what must be done with the command. If nothing will be 
done, the command is ACKed and removed. If a Beache read, set shared, or invalidate is required 
it will be done as soon as the Beache is free. The command will be ACKed at the start of the 
Beache transaction. 


In general, the DECchip 21164-AA BIU can hold one or two misses and one or two Scache victim 
address. These four addresses along with the SYSTEM request will ARB for the Bcache. Data 
movement for the SYSTEM is the highest priority for the Beache. This includes fill, reading dirty 
data, invalidates, and set shared. If there are no SYSTEM requests for the Beache, a DECchip 
21164-AA command will be selected. 


All transactions between DECchip 21164-AA and the SYSTEM are non-pended, except for fills. 
DECchip 21164-AA may request up to two fills from memory (if the SYSTEM allows two). Any 
read or write transaction in the cache must be completed once it is started. 


Blocks in the Scache/Bcache that have data movement pending to them will not be read or written 
by the CPU until the data movement is completed. The SYSTEM will not be prevented from reading 
or writing blocks in the Scache/Bcache. For example if the CPU has requested a write to a clean 
block, it will not be allow to access that block until the block until the write completes, but the 
SYSTEM will always be able to access the block. 


The SYSTEM may have one or more Beache victim buffers. Each time a Beache victim is produced, 
DECchip 21164-AA will stop reading the Beache until the SYSTEM takes the current victim. Bcache 
operations will then resume. 


DECchip 21164-AA requires wrapped reads on INT16 boundaries. The valid wrap orders for 64 
byte blocks are selected by bits PA<5:4>, they are: 


« 0,1, 2,3 
* 1,0, 3,2 
¢ 2,3,0,1 
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e 3,2,1,0 - 

For 32 byte blocks the valid wrap orders are selected by PA<4>, they are: 
° 0,1 

e 1,0 


WRITE BLOCK and WRITE BLOCK LOCK commands from DECchip 21164-AA will not be 
wrapped. They will always write INT16 zero, one, two, and three. BCACHE VICTIM commands 
will provide the data with the same wrap order as the read miss that produced them. 


4.1.2 Physical Memory Regions 


DECchip 21164-AA physical memory is divided into three regions. The first region is the first half 
of the physical address space. It is treated by DECchip 21164-AA as memory-like. The second 
region is the second half of the physical address space except for a 1MByte region reserved for 
Chox IPRs. It is treated by DECchip 21164-AA as non-cachable. The third region is the 1Mbyte 
region reserved for Cbox IPRs. 


In the first region, writeback caching, write merging and load merging are all permitted. All 
DECchip 21164-AA accesses in this region are 32-byte or 64-byte depending on the block size. 


DECchip 21164-AA does not cache data accessed in the second and third region of the physical: 
address space. DECchip 21164-AA read accesses in these regions are always 32-byte requests. 
Load merging is permitted, but the request includes a mask to tell the SYSTEM environment 
which INT8s are accessed. Write accesses are 32-byte requests, with a mask indicating which 
INT4s are actually modified. DECchip 21164-AA will never write more than 32-bytes at a time 
in non-cached space. 


DECchip 21164-AA does not emit accesses to the Cbox IPR region if they map to a Cbhox IPR. 
Accesses in this region that are not to a defined Cbox IPR produce UNDEFINED results. 


Table 4-1: Physical Memory Regions 


Region Address Range Description 

memory-like 0000000000- Writeback cached, load and store merging allowed 
TEFFEFFFFF (hex) 

non-cacheable 8000000000- not cached, load merging limited 
FFFFEFFFFF(hex) 

Chox IPR region FFFFF00000- Chox IPRs, accesses do not appear on the pins un- 
FFFFFFFFFF (hex) less an undefined location is accessed (which produces 

UNDEFINED results) 


4.1.3 Possible Configurations 


The DECchip 21164-AA cache system allows for several system configurations. They can be 
broken into two classes: those that use the write invalidate cache coherence protocol and those 
that use the flush based protocol. Table 4—2 shows the components that would make up the 
system designs that are possible with DECchip 21164-AA. 
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Table 4-2: System Designs 


Scache Duplicate Beache Duplicate 

System Type Scache Tag Beache Tag Lock Reg. 
Write Invalidate Yes No No No No 

Write Invalidate Yes Yes No No Required 
Write Invalidate Yes No Yes Required Required 
Flush Yes No No No No 

Flush Yes No Yes No No 

Flush Yes No Yes Yes Required 


In a write invalidate based design, DECchip 21164-AA will expect the SYSTEM to use the READ 
DIRTY, READ DIRTY/INVALIDATE, INVALIDATE, and SET SHARED, commands to keep the 
state of each block up to date. 


In a flush based design, DECchip 21164-AA will expect the READ and FLUSH commands to be 
used to remove blocks from the cache. 


4.1.4 Maintaining Cache Coherence 


In a coherent design, DECchip 21164-AA requires the SYSTEM to have some properties to make 
things work. 


DECchip 21164-AA requires the SYSTEM to allow only one change to a block at a time. This means 
that if DECchip 21164-AA wins the bus to read or write a block, no other node on the bus will be 
allowed to access that block until the data has been moved. 


If DECchip 21164-AA attempts to write a clean/private block of memory, it will send a SET 
DIRTY command to the SYSTEM. At the same time the SYSTEM might be sending a SET SHARED 
or INVALIDATE command to DECchip 21164-AA for the same block. The bus is the coherence 
point in the SYSTEM, so if the bus has already changed the state of the block to shared, setting the 
dirty bit is the wrong thing to do. DECchip 21164-AA will not resend the SET DIRTY command 
when the ownership of the ADDRESS/CMD bus is returned. The write will be restarted and use 
the new tag state to generate a new system request. 


It is also possible for the SYSTEM to send an INVALIDATE at the same time DECchip 21164-AA 
is attempting to do a WRITE BLOCK or WRITE BLOCK LOCK. In this case DECchip 21164- 
AA will abort the WRITE BLOCK transaction, service the INVALIDATE, and then restart the 
WRITE BLOCK transaction. 


In both of these cases if the SET DIRTY or WRITE BLOCK is started by DEC chip 21164-AA, and 
then interrupted by the SYSTEM, DECchip 21164-AA will resume the same transaction unless the 
SYSTEM request was to the same block as the request DECchip 21164-AA had started. In this 
case the DECchip 21164-AA request will be restarted internally by the CPU and it is unpredictible 
what transaction DECchip 21164-AA will next present to the system. 


DECchip 21164-AA will maintain the processors Deache as a subset of the Scache. If a Beache is 
present, the Scache will be maintained as a subset of the Bcache. 


The processors Icache is not a subset of any cache and is incoherent with the rest of the cache 
system. 


4-4 External Interface DIGITAL RESTRICTED DISTRIBUTION 


DEC Chip 21164-AA (EV5 CPU) Specification, Revision 1.9, December 1992 


4.1.5 Cache State 
The following tables describe the DECchip 21164-AA multiprocessor cache coherence protocol, 


a modification of the protocol described in the Laser System Bus Specification Revision 1.2. 
DECchip 21164-AA will not take an update to a shared block, the block will always be invalidated. 


Table 4-3: Cache States 


Vv s D State of cache line assuming tag match 

0 xX X Not valid 

1 0 0 Valid for read or write. This cache line contains the only cached copy of 
the block and the copy in memory is identical to this line. 

1 0 1 Valid for read or write. This cache line contains the only cached copy of 
the block. The contents of the block have been modified more recently than 
the copy in memory. 

1 1 0 Valid for read or write but writes must be broadcast on the bus. This block 
MAY be in some other CPUs cache. 

1 1 1 Valid for read or write but write must be broadcast on the bus. This block 


MAY be in some other CPUs cache. The contents of the block have been 
modified more recently than the copy in memory. 


Table 4-4: System Actions 
System Tag Probe 


Command Results Bus Response New Cache State Comments 

Read Miss ~Shared, ~Dirty No change 

Rd_ex Miss ~Shared, ~Dirty No change 

Write Miss ~Shared, ~Dirty No change 

Read Hit, ~Dirty Shared, ~Dirty Shared, ~Dirty 

Read Hit, Dirty Shared, Dirty Shared, Dirty | —‘ This cache supplies the data 
Rd_ex Hit, ~Dirty ~Shared, ~Dirty Invalid 

Rd_ex Hit, Dirty ~Shared, Dirty Invalid This cache supplies the data 
Write Hit ~Shared, ~Dirty Invalid 
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Tabie 4—5: Processor Actions 


Processor Cache Probe 
Command _ Results 

Read Invalid 

Read Invalid 

Write Invalid 

Read Miss, ~Dirty 

Read Miss, ~Dirty 

Write Miss, ~Dirty 

Read Miss, Dirty 

Read Miss, Dirty 

Write Miss, Dirty 

Read Hit 

Write Hit, Dirty, ~Shared 
Write Hit, Dirty, Shared 
Write Hit, ~Dirty, ~Shared 
Write Hit, ~Dirty, Shared 


DECchip 21164-AA 
ADDR CMD 

Read Miss 

Read Miss 

Read Miss Mod 


Read Miss 
Read Miss 
Read Miss Mod 


Victim, 
Read Miss 


Victim, 
Read Miss 


Victim, 
Read Miss Mod 


NOP 

NOP 

Write Block 
Set Dirty 
Write Block 


Bus Response 
~Shared 
Shared 
~Shared 


~Shared 
Shared 
~Shared 


~Shared 
Shared 


~Shared 


NOP 
NOP 
~Shared 
NOP 
~Shared 


New Cache State 
~Shared, ~Dirty 
Shared, ~Dirty 
~Shared, Dirty 


~Shared, ~Dirty 
Shared, ~Dirty 
~Shared, Dirty 


~Shared, ~Dirty 
Shared, ~Dirty 


~Shared, Dirty 


No change 
~Shared, Dirty 
~Shared, ~Dirty 
~Shared, Dirty 
~Shared, ~Dirty 


If DECchip 21164-AA requests a READ MISS MOD, DECchip 21164-AA expects the block to be 
returned ~shared, dirty. However, if the system returns the data shared, ~dirty DECchip 21164- 
AA will follow with a WRITE BLOCK command. Doing this might expose the system to livelock 


problems. 


4.1.6 DECchip 21164-AA Interface 


The interface can be divided into two parts. The SYSTEM interface and the Beache interface. Both 
parts share the data bus. 


The SYSTEM interface is made up of a bi-directional command and address bus, and several 
control signals. They are described in Section 4.1.6.1. The Bcache interface signals are described 
in Section 4.1.6.2. 
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4.1.6.1 System Interface . 


These are the signals that make up the SYSTEM interface. All are driven and received by DECchip 
21164-AA on the rising edge of Sysclock. 


¢ ADDR_H<39:4> 
Bi-directional 
This is the address of the requested data or operation. If bit 39 is asserted, the reference is 
to non-cached memory. 

¢ CMD_H<3:0> 
Bi-directional 
Table 4—6 lists the encodings for the commands that DECchip 21164-AA can drive on the 
CMD bus. Optional commands can be disabled in systems that do not require them. It is 


unpredictable if DECchip 21164-AA will drive a disabled command to the SYSTEM, however, 
no CACK should ever be sent for a disabled command. 


Table 4-6: DECchip 21164-AA Commands to the System 


CMD<3:0> 


0000 
0001 
0010 
0011 
0100 
0101 
0110 
o111 
1000 
1001 
1010 
1011 
1100 
1101 
1110 
1111 


Command Optional 
NOP No 
LOCK Yes 
FETCH No 
FETCH_M No 
MEMORY BARRIER Yes 
SET DIRTY Yes 
READ MISSO No 
READ MISS1 No 


READ MISS MODO No 
READ MISS MOD1 = No 
BCACHE VICTIM No 


WRITE BLOCK No 
WRITE BLOCK LOCK No 


Comments 

Nothing 

New lock register address 

DECchip 21164-AA passing a FETCH to the system 
DECchip 21164-AA passing a FETCH_M to the system 
MB instruction 

Dirty bit will be set if shared is still clear 

spare 

spare 

Request for data 

Request for data 

Request for data, modify intent 

Request for data, modify intent 

Beache victim should be removed 

spare 

Request to write a block 

Request to write a block with lock 


Table 4-7 lists the encodings for the commands that the SYSTEM can drive on the CMD bus. 
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Table 4-7: System Commands to DECchip 21164-AA 


CMD<3:0> Command Comments 

0000 NOP Nothing 

0001 FLUSH Remove block from caches, return dirty data 
0010 INVALIDATE Remove the block 

0011 SET SHARED Block goes to the shared state 

0100 READ Read a block 

0101 READ DIRTY Read a block, set shared 

0111 READ DIRTY/INV Read a block, invalidate 


¢ ADDR_CMD_PAR_H 


Bi-directional ; 

This is the odd parity on the current command and address bus. DECchip 21164-AA will take 
a machine check if a parity error is detected. The SYSTEM should do the same if it detects an 
error. 

VICTIM_PENDING_H 

Output 

Indicates that the current read miss had generated a victim. Systems may want to hold off 
requesting the command/address bus until the victim has been removed. 
ADDR_BUS_REQ_H 

Input 

If this signal is asserted before the rising edge of a Sysclock, DECchip 21164-AA will not drive 
the ADDRESS or CMD busses during the next cycle. 

CACK_H 

Input 

If this signal is asserted before the rising edge of a Sysclock, DECchip 21164-AA will drive 
the next address and cmd during the next cycle. 

CFAIL_H 

Input 

CFAIL has two uses. It should be used during the CACK cycle of a WRITE_BLOCK_LOCK 
command to indicate that the write has failed. It can also be used in cycles were CACK is 
not asserted to force an Ibox timeout event which, in turn, causes a partial reset of DECchip 
21164-AA and will trap to the MCHK PAL code entry point. 

RES_H<1:0> 

Output 

Table 4—8 lists the encoding of DECchip 21164-AA responses to SYSTEM requests. 
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Table 4-8: DECchip 21164-AA Responses to System Commands 


RES<1:0> Command Comments 

00 NOP Nothing 

01 ' NOACK Data not found or clean 
10 ACK/Scache Data from Scache 

11 _ ACK/Beache Data from Beache 


e =6INT4_VALID_H<3:0> 
Output 
During writes, these wires are used to indicate which INT4 of data are valid. This is useful 
for non-cached writes that have been merged in the write buffer. During reads, these wires 
indicate which INTS8 bytes of a 32 byte block need to be read and returned to the processor. 
This is useful for reads to non-cached memory. 

¢ SCACHE_SET_H<1:0> 
Output 
During a read miss request, these pins will indicate the Scache set number that will be 
filled when the data is returned. This information can be used by the SYSTEM to maintain a 
duplicate copy of the Scache tag store. 

¢ FILL_H 
Input 
If this signal is asserted in Sysclock N, DECchip 21164-AA will provide the address indicated 
by the FILL ID to the Beache in Sysclock N+2. The Beache will begin to write in that Sysclock. 
At the end of the write, DECchip 21164-AA will wait for the next Sysclock and then begin 
the write again (It may take more than one Sysclock to write the Bcache). 

¢ FILL_ID_H 
Input 
If this signal is asserted in Sysclock N, DECchip 21164-AA will provide the address from miss 
register 1. If it is deasserted, the address in miss register zero will be used for the fill. 

¢ FILL_ERROR_H 
Input 
If this signal is asserted while a fill is pending from memory, it will indicate to DECchip 
21164-AA that system has detected an invalid address or hard error. System will still provide 
an apparently normal fill sequence with correct ECC/parity though the data is not valid. 
DECchip 21164-AA will trap to the MCHK PAL code entry point. 

¢ DACK_H 
Input 
For Fills, if this signal is asserted before the Sysclock edge, it will indicate to DECchip 21164- 
AA that fill data was valid that Sysclock and DECchip 21164-AA should switch to the next 
address at the Sysclock edge. 
For writes, if this signal is asserted before the Sysclock edge, it indicates that DECchip 
21164-AA should provide the next address and data at the Sysclock edge. 

e FFLL_NOCHECK_H 
Input 
Do not check the parity or ECC for the current data cycle on a fill. 
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e SYSTEM_LOCK_FLAG_H 
Input 
This wire indicates the state of the system lock flag. During Fills, DECchip 21164-AA will 
AND the value of the system copy with its own copy to produce the true value of the lock flag. 
¢ IDLE_BC_H 
Input 
When this wire is asserted, DECchip 21164-AA will finish the current Bcache read or write. 
The CPU will not be allowed to start a new read or write until the wire is deasserted. Systems 
must assert this wire in time to idle the Bcache before a fill arrives. It can also be used to 
improve the response time of DECchip 21164-AA to SYSTEM requests. 
The time required to idle the Beache is a function of the internal design of DECchip 21164-AA, 
the block size, the read and write speed of the Beache, the amount of tri-state overlap that 
must be avoided, and the Sysclock ratio. Take the larger of: 
read idle = 3 + (block_size/16)*BC_RD_ SPD + tri-state_ram_turn_off 


or 
write idle = 5 + (block size/16)*BC_WRT_SPD + tri-state_cpu_turn_off 


and round up to the next Sysclock value. This is the number of Sysclocks required between 
DECchip 21164-AA receiving IDLE_BC until the Bcache will be idle. 
For example if the Sysclock ratio is 6, BC_RD_SPD is 4, BC_WRT_SPD is 5, block size is 
32B, and two idle CPU cycle are required to turn off the RAM drivers, for reads, and zero 
are required to turn off DECchip 21164-AA’s write drivers, then it will take max(3+2*4+2, 
5+2*5+0)/6 = 3 Sysclocks to idle the cache. If IDLE_BC is asserted in Sysclock N, then the 
first fill data could be written in Sysclock N+3. 
For FILL requests, IDLE_BC can be de-asserted any time after the fill starts. 

e DATA_BUS_REQ_H 
Input 
If this signal is asserted in Sysclock N, DECchip 21164-AA will not drive the data bus in 
Sysclock N+2. Before asserting this signal the system should assert IDLE_BC for the correct 


number of cycles. If this signal is deasserted in Sysclock N, DECchip 21164-AA will drive the 
data bus in Sysclock N+2. 


4.1.6.2 Bcache Interface 


These signals make up the Beache interface. Reads and writes of the Beache that do not involve 
the SYSTEM will begin on any CPU clock. If the Beache read or write involves receiving or sending 
data to the SYSTEM, then the access will begin on a rising Sysclock edge. 
¢ INDEX_H<25:4> 

Output 

These wire are used to index the Bcache. 
¢ DATA H<127:0> 

Bi-directional 

This bus is used to move data between DECchip 21164-AA, the Beache, and the SYSTEM. 
¢ DATA_CHECK_H<15:0> 

Bi-directional 

Either even byte parity or INT8 ECC for the current data cycle. 
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¢ TAG DATA _H<38:22> 
Bi-directional 
Beache tag data bits. This allows for Beaches in the 4MB to 64MB range. 
¢ TAG _DATA_PAR_H 
Bi-directional 
Odd parity for TAG_DATA_H<38:22>, the SYSTEM should force unused bits to zero. 
e TAG VALID_H 
Bi-directional 
The current tag contains a valid block. DECchip 21164-AA will assert this pin during fills. 
¢ TAG SHARED_H 
Bi-directional 
The block is in the shared state. During fills the SYSTEM should drive TAG_SHARED_H with 
the correct value. 
¢ TAG_DIRTY_H 
Bi-directional 
The block is in the dirty state. During fills the SYSTEM should assert this bit if the DECchip 
21164-AA request was a READ MISS MOD, and the shared bit is not asserted. 
¢ TAG CTL_PAR_H 
Bi-directional 
Odd parity for TAG_VALID_H, TAG_SHARED_H, and TAG_DIRTY_H. During fills the sys- 
tem should drive the correct parity based on the state of the V, S and D bits. 
¢ TAG RAM_OE_H 
Output 
This signal will be asserted by DECchip 21164-AA during any Beache read. 
¢ TAG RAM _ WE_H 
Output 
This signal will be asserted by DECchip 21164-AA, using the write pulse register contents, 
during any tag write. During the first CPU cycle of a write, the write pulse will be de-asserted. 
In the second and following CPU cycles of the write, the write pulse will be asserted if the 
corresponding bit in the write pulse register is asserted. 
¢ DATA _RAM_OE_H 
Output 
This signal will be asserted by DECchip 21164-AA during any Bcache read. 
¢ DATA_RAM_WE_H 
Output 
This signal will be asserted by DECchip 21164-AA, using the write pulse register contents, 
during any data write. During the first CPU cycle of a write, the write pulse will be de- 
asserted. In the second and following CPU cycles of the write, the write pulse will be asserted 
if the corresponding bit in the write pulse register is asserted. 
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4.1.7 


4-12 


DECchip 21164-AA Interface Command Descriptions 


FETCH/FETCH_M 

From DECchip 21164-AA 

These commands are issued by DECchip 21164-AA when the FETCH and FETCH_M instruc- 
tions are executed. 

FLUSH 

From SYSTEM 

The FLUSH command will cause a block to be removed from the DECchip 21164-AA cache 
system. If the block is not found, DECchip 21164-AA will respond with NOACK. If the block 
is found and the block is clean, DECchip 21164-AA will respond with NOACK. The block will 
be invalidated in the Deache, Scache, and Beache. If the block is found and dirty, DECchip 
21164-AA will respond with ACK/Seache or ACK/Beache. If the data was found dirty in the 
Scache it will be driven at the pins in the same Sysclock as the ACK/Scache. If the data is 
found dirty in the Beache, the Beache read will start on the same Sysclock as ACK. The block 
will be invalidated in the Dcache, Scache, and Beache. 

LOCK 

From DECchip 21164-AA 

This command is used to load the System lock register. The state of the SYSTEM lock reg- 
ister flag is used on each fill to update the DECchip 21164-AA copy of the lock flag. See 
Section 4.1.8.12 for the full story. 

MEMORY BARRIER 


From DECchip 21164-AA 

This command is issued by DECchip 21164-AA to synchronize read and write accesses with 
other processors in the SYSTEM. DECchip 21164-AA issues this command when a MB instruc- 
tion is executed. DECchip 21164-AA will stop issuing memory reference instructions and wait 
for the command to be acknowledged before continuing. 

NOP 


From DECchip 21164-AA or SYSTEM 

Nothing. This command should be driven by the owner of the CMD bus if it has nothing to 
do. 

READ 

From SYSTEM 

The READ command will probe the Scache and Beache to see if the requested block is present. 
If the block is present, DECchip 21164-AA will respond with ACK/Scache or ACK/Bcache. If 
the data is in Scache, the data will be driven on the DATA bus in the same Sysclock as the 
ACK. If the data is in the Beache, a Beache read will begin in the same Sysclock as the ACK. 
If the block is not present in either cache, DECchip 21164-AA will assert NOACK on the RES 
wires. 

READ DIRTY 


From SYSTEM 

The READ DIRTY command will probe the Scache to see if the requested block is present 
and dirty. If the block is not found, or the block is clean, and the SYSTEM does not contain 
a Beache, DECchip 21164-AA will respond with a NOACK. If the block is found and dirty 
in the Scache, DECchip 21164-AA will respond with ACK/Scache and drive the data on the 
DATA bus. If the block is not found in the Scache, and the SYSTEM contains a Beache, it is 
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assumed to be in the Beache. DECchip 21164-AA will respond with ACK/Beache, index the 
Bcache to read the block, and will change the block status to the shared dirty state. 

¢ READ DIRTY INVALIDATE 
From SYSTEM 
This command is identical to the READ DIRTY command except if the block is present it will 
be invalidated from the caches. 

¢ READ MISSn 
From DECchip 21164-AA 
This command is used to indicate that DECchip 21164-AA has probed its caches and that the 
addressed block was not present. 

¢ READ MISS MODIFYn 
From DECchip 21164-AA 
This command is used to indicate that DECchip 21164-AA plans to write to the returned 
cache block. Normally the dirty bit should be set when the tag status is returned to DECchip 
21164-AA. 

e SET SHARED 
From SYSTEM 
The SET SHARED command is used by the SYSTEM to change the state of a block in the cache 
system to shared. The shared bit in the Scache will be set if the block is present. The Bcache 
tag will be written to the shared not dirty state. DECchip 21164-AA assumes that this is ok, 
because the SYSTEM would have sent a READ DIRTY if the dirty bit were set. 
If the block is found in the Scache, DECchip 21164-AA will respond with ACK/Scache. 
Otherwise, if the SYSTEM contains a Bcache, the block is assumed to be in the Beache and 
DECchip 21164-AA will respond with ACK/Beache. If the SYSTEM does not contain a Beache 
and the block is not found in the Scache, DECchip 21164-AA will respond with a NOACK. 

¢ SET DIRTY 


From DECchip 21164-AA 

DECchip 21164-AA wants to write a clean, private block in its Scache and wants the dirty 
bit set in the duplicate tag store. The CPU will not proceed with the write until an CACK 
response is received from the SYSTEM. When the CACK is received, DECchip 21164-AA will 
attempt to set the dirty bit. If the shared bit is still clear the dirty bit will be set and the 
write completed. If the shared bit is set the dirty bit will not be set, and DECchip 21164-AA 
will request a WRITE BLOCK. The copy of the dirty bit in the Beache will not be updated 
until the block is removed from the Scache. 

¢ INVALIDATE 


From SYSTEM 
DECchip 21164-AA will probe the Scache and invalidate the block if it is present. If the 
Beache is present the block will be changed to the invalid state without probing. 
If the block is found in the Scache, DECchip 21164-AA will respond with ACK/Scache. 
Otherwise, if the SYSTEM contains a Bcache, the block is assumed to be in the Bcache and 
DECchip 21164-AA will respond with ACK/Beache. If the SYSTEM does not contain a Bcache 
and the block is not found in the Scache, DECchip 21164-AA will respond with a NOACK. 

¢ BCACHE VICTIM 
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From DECchip 21164-AA 

If there is a victim buffer in the SYSTEM, this command is used to pass the address of the victim 

to the SYSTEM. The read miss that produced the victim will preceed the BCACHE VICTIM 

command. The VICTIM_PENDING wire will be asserted during the read miss command to 

indicate that a BCACHE victim command is waiting, and that the Bcache is starting the read 

of the victim data. 

If the SYSTEM does not have a victim buffer the BCACHE VICTIM command will preceed 

the read miss commands. The BCACHE VICTIM command will be driven, along with the 

address of the victim. At the same time the Beache will be read to provide the victim data. 
e WRITE BLOCK 


From DECchip 21164-AA 
DECchip 21164-AA wants to write a block of data back to memory. DECchip 21164-AA will 
drive the command, address, and first INT16 of data on a Sysclock edge. DECchip 21164-AA 
will output the next INT16 of data when a DACK is received. When the SYSTEM asserts 
CACK, DECchip 21164-AA will remove the command and address from the pins and begin 
the write of the Scache. CACK can be asserted before all the data is removed. 
¢ WRITE BLOCK LOCK 

From DECchip 21164-AA 

This command is the same as a WRITE BLOCK except that a CFAIL may be asserted by the 

: SYSTEM to indicate that the data can not be written. this command is only used for STx_C in 
non-cached space. 


4.1.8 Transactions 


This section will describe how the commands are used to move data in and out of DECchip 
21164-AA and its cache system. 


Figure 4-1 shows the resources that can be used by the CPU and SYSTEM. They are listed here. 


« 2 CPU commands and addresses 
¢ 2 Scache victim address 
e 2 System command and address 


4.1.8.1 Read Miss 


DECchip 21164-AA will start a Beache read on any CPU clock. The index will be asserted to the 
RAM for a programmable number of CPU cycles in the range of 4 to 10. The tag will be accessed 
at the same time. At the end of the first read, DECchip 21164-AA will latch the data and tag 
information and begin the read of the next 16 bytes of data. The tag will be checked for a hit. 
If there is a.miss, a READ MISS or READ_MISS_MOD command along with the address will 
be queued to the CMD/ADDRESS bus. It will appear on the pins at the next Sysclock edge. 
Figure 4—2 shows the timing of a Bcache read and the resulting READ MISS request. 


Figure 4—2 shows the READ MISS command being CACKed as soon as it is sent. This will allow 
DECchip 21164-AA to make additional READ MISS requests. It is also possible for the SYSTEM 
to defer the CACK until the fill data is returned. This allows the SYSTEM to use CMD<0> for the 
value of FILL_ID. The CACK should arrive no later than the last fill DACK. 
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Figure 4-2: Read Miss 
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4.1.8.2 Read Miss with victim 


DECchip 21164-AA supports two models for removing displaced dirty blocks from the Beache. 
The first assumes that the SYSTEM does not contain a victim buffer. In this case the victim must 
be read from the Beache before the new block can be requested. In the second case, if the SYSTEM 
does have a victim buffer, DECchip 21164-AA will request the new block from memory while it 
starts to read the victim from the Beache. The victim command and address will follow the miss 
request. 


In either case, DECchip 21164-AA treates a miss/victim as single transaction. If the assertion 
of ADDR_BUS_REQ or IDLE_BC causes the BIU sequencer to reset, both the miss and victim 
transactions will be restarted from the begining. For example if DECchip 21164-AA is operating 
in victim first mode and it sends a BCACHE VICTIM command to the SYSTEM and then the system 
sends an INVALIDATE to DECchip 21164-AA, DECchip 21164-AA will restart the Bcache read 
and resend the BCACHE VICTIM command and data and then the READ_MISS. 


The next two sections describe each of these methods of victim processing. 


4.1.8.2.1. Without a Victim Buffer 


If the SYSTEM does not contain a victim buffer, DECchip 21164-AA will stop reading the Beache as 
soon as the miss is detected. This will be sometime during the second read. A BCACHE VICTIM 
command will be asserted at the next Sysclock along with the victim address. A Bcache read of 
the victim will also be started at the Sysclock edge. When the DACK is received for the first part 
of the victim, DECchip 21164-AA will begin reading the next part of the victim. CACK can be 
sent anytime during the processing of the victim. DECchip 21164-AA will send out the READ 
MISS command in the Sysclock after the CACK is received. Figure 4-3 shows the timing of a 
victim being removed. 
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Figure 4-3: Read Miss with Victim 
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4.1.8.2.2 With a Victim Buffer 


: When the miss is detected, if the SYSTEM has a victim buffer, DECchip 21164-AA will wait for the 
next Sysclock edge and then assert a READ MISS command, the read miss address, the VICTIM_ 
PENDING wire, and index the Beache to begin the read of the victim. When the SYSTEM asserts 
CACK, DECchip 21164-AA will send out the BCACHE VICTIM command along with the victim 
address. Each assertion of DACK will cause the Bceache index to advance to the next part of the 
block. Figure 4—4 shows the timing of a read miss with a victim. 


Figure 4-4: Read Miss with Victim Buffer 
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4.1.8.3 Fill 


The fill wires are used to control the return of fill data to DECchip 21164-AA and the Beache if 
it is present. The IDLE_BC_H wire must be used to stop CPU requests in the Bcache in such a 
way that the Beache will be idle when the fill data arrives (but not the fill command). FILL_H 
should be asserted at least two Sysclocks before the fill data arrives. The FILL_ID_H wire should 
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be asserted at the same time to indicate if the fill will be for a READ MISSO or READ MISS1. 
DECchip 21164-AA will use this information to select the correct fill address. If FILL and FILL_ 
ID are asserted at the end of Sysclock N, then DECchip 21164-AA will assert the Bceache index 
and begin a Beache write during Sysclock N+2. The SYSTEM should drive the data onto the DATA 
bus and assert DACK before the end of the Sysclock cycle. This will cause DECchip 21164-AA 
to move on to the next fill address and begin another write of the Bcache. The SYSTEM must 
allocate the right number of Sysclock cycles to allow the writing of the Beache if it is present. 
For example if the Bceache requires 17ns to write and the Sysclock is 12ns, two Sysclock cycles 
will be required for each write. 


During the first fill of a block, the SYSTEM should also drive the correct values on the TAG_ 
SHARED, TAG_DIRTY, and, TAG_PARITY wires. DECchip 21164-AA will assert TAG_VALID 
and write the Beache tag store during the first fill. 


Figure 4-5: Fill 
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4.1.8.4 Write Block 


The WRITE BLOCK command will be used to complete writes to shared data, to remove Scache 
victims in Beache-less systems, and to complete writes to non-cached memory. 


The WRITE BLOCK LOCK command follows the same protocol. The LOCK qualifier might allow 
the SYSTEM to be more aggressive on non-interlocked writes. 


DECchip 21164-AA will assert the WRITE BLOCK command along with the address and the first 
16 bytes of data at the start of a Sysclock. If the SYSTEM takes away the ownership of the CMD 
and ADDRESS bus, DECchip 21164-AA will hold on to the write and wait for the ownership of 
the bus to be returned. If the block in question is invalidated, the write will be restarted by the 
CPU and will result in the READ MISS MOD request instead. 


When the SYSTEM has taken the first part of the data it should assert DACK. This will cause 
DECchip 21164-AA to drive the next 16 bytes of data at the next Sysclock edge. 


If the SYSTEM asserts CACK, DECchip 21164-AA will output the next command in the next 
Sysclock. Receiving the CACK indicates to DECchip 21164-AA that the write will be taken and 
that it is safe to update the Scache with write data. 


DIGITAL RESTRICTED DISTRIBUTION External Interface 4-17 


DEC Chip 21164-AA (EV5 CPU) Specification, Revision 1.9, December 1992 


During each cycle the INT4_VALID_H<3:0> wires will indicate which INT4 parts of the write are 
really being written by the processor. For writes to cached memory, all of the data will be valid. 
For writes to non-cached memory, only those INT4 with the INT4_VALID_H<n> signal asserted 
are valid. 


Figure 4—6 shows the timing of a write block command. 


Figure 4-6: Write Block 
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4.1.8.5 Set Dirty, Lock 
Figure 4—7 shows the timing of a SET DIRTY command and a LOCK command. 


The SET DIRTY command is used by DECchip 21164-AA to inform a duplicate tag store that a 
cached block is changing from the not-shared clean state to the not-shared dirty state. When the 
CACK is received from the SYSTEM, DECchip 21164-AA will attempt to set the dirty bit. If the 
shared bit has been set since the original probe of the Scache, or the block has been invalidated, 
DECchip 21164-AA will restart the write. This will produce a new request which reflects the new 
state of the block. If the block is still in the not-shared clean state, the dirty bit will be set and 
the write completed. 


The LOCK command is used by DECchip 21164-AA to pass the address of a LDx_L to the SYSTEM. 
A system lock register is required in any system that filters write traffic with a duplicate tag store. 
If the locked block is displaced from the DECchip 21164-AA caches, DECchip 21164-AA will use 
the value of the system lock register to determine if the LDx_L/STx_C sequence should pass or 
fail. 
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Figure 4-7: Set Dirty, and Lock 
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4.1.8.6 Flush 


The FLUSH command can be used to remove blocks from the DECchip 21164-AA cache system. 
If the block is dirty, the block will be read from the caches to allow the updating of memory. 
Figure 4-8 shows the timing of a FLUSH transaction. 


Figure 4-8: Flush 
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4.1.8.7 Read Dirty, and Read Dirty/INV 


The READ DIRTY command is used to read modified data from the cache system. The block is also 
transitioned into the shared state. Figure 4—9 shows the timing of a READ_DIRTY transaction. 
The Scache will be probed and the data read if it is found. The state will also be set to shared. 
If the data is not found in the Scache, it is assumed to be in the Beache. DECchip 21164-AA will 
start the read of the Beache and write the tag to the shared state. 
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The READ DIRTY/INV command is identical to the READ DIRTY command except the block is 
transitioned to the invalid state instead of the shared state. 


Figure 4-9: Read Dirty 
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4.1.8.8 invalidate 


The INVALIDATE command can be used to remove a block from the cache system. Unlike the 
FLUSH command, any modified data will not be read. The Scache will be probed and invalidated 
if the block is found. The Bcache will be invalidated without probing. Figure 4—10 shows the 
timing of an INVALIDATE transactions. 


Figure 4-10: Invalidate 
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4.1.8.9 Set Shared 


When DECchip 21164-AA revieves a SET_SHARED command, it will probe the Scache and change 
the state of the block to shared if it is found. DECchip 21164-AA will assume that the block is in 
the Beache and write the state of the tag to shared, not-dirty. Figure 4—11 shows the timing of a 
SET_SHARED command. 


Figure 4-11: Set Shared 
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4.18.10 Non-cached Reads 


Reads to physical addresses that have bit 39 asserted will not be cached in the Dceache, Scache, 
or Beache. They will be merged like any other read in the miss address file. To prevent several 
reads to non-cached memory from being merged into a single 32 byte bus request, software must 
insert MB instructions. The miss address file will merge as many Dstream reads together as 
it can and send the request to the BIU via the Scache. The BIU will not merge two 32 byte 
requests into a single 64 byte request. The BIU will request a READ MISS from the SYSTEM. 
DATA_VALID<3:0>_H will indicate which of the four quadwords are being requested by software. 
The SYSTEM should return the fill data to DECchip 21164-AA in the normal way. DECchip 21164- 
AA will not write the Deache, the Scache, or the Beache with the refill. The requested data will 
be written in the register file or Icache. 


4.1.8.11 Non-cached Writes 


Writes to physical addresses that have bit 39 asserted will not be written to any of the caches. 
They will be merged in the write buffer before being sent to the SYSTEM. If software does not 
want writes to merge it must insert MB or WMB instructions between them. 


When the write buffer decides to dump data to non-cached memory the BIU will request a WRITE 
BLOCK. Each data cycle, DATA_VALID<3:0> will indicate which INT4s within the INT16 were 
really written. 
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4.1.8.12 Locks 


The LDx_L instructions will be forced to miss in the Deache. When the Scache is read, the Lock 
register in the BIU will be loaded with the physical address and the lock flag set. The BIU will 
send a LOCK command to the SYSTEM so it can load its lock register. The SYSTEM lock register 
will only be used if the locked block is displaced from the cache system. The lock flag will be 
cleared if any of the following things happen: 


¢ Any write from the bus occurs to the locked block (FLUSH, INVALIDATE, or READ_DIRTY_ 
INV). 
e ASTx_C by the processor. 


The SYSTEM copy of the lock register is required on systems that have a duplicate tag store to filter 
write traffic. The direct mapped Icache, Deache, and Beache along with the sub-setting rules, 
branch prediction, and Istream prefetching can cause a lock to always fail because of constant 
Scache thrashing of the locked block. Each time a block is loaded into the Scache, the value of the 
lock register will be ANDed with the value of the SYSTEM_LOCK_FLAG signal. If the locked 
block is displaced from the cache system, DECchip 21164-AA will not see bus writes to the locked 
block, in this case the SYSTEM’s copy of the lock register will correct the processor copy of the lock 
flag when the block is filled into the cache via the signal SYSTEM_LOCK_FLAG_H. 


- Systems that do not have a duplicate tag stores, and send all probe traffic to DECchip 21164-AA 
are not required to have a copy of the lock flag. They should wire the SYSTEM_LOCK_FLAG H 
to TRUE. 


When the STx_C is issued the Ibox will stop issuing memory type instructions. The store will 
update the Deache in the normal way, and be placed in the write buffer by itself. It will not be 
merged with other pending writes. The write buffer will be flushed. 


When the write buffer gets to a STx_C in cached memory, it will probe the Scache to check the 
block state. When the STx_C passes through the Scache, an invalidate will be sent to the Deache. 
If the Lock flag is clear, the STx_C will fail. If the block is not-shared dirty, the write buffer will 
write the STx_C data into the Scache. Success will be written to the register file and the Ibox 
will begin issuing memory instructions again. If the block is in the shared state, the BIU will 
request a WRITE BLOCK LOCK. If the WRITE BLOCK LOCK is CACKed, the Scache will be 
written and the Ibox started as above. If the WRITE BLOCK LOCK is CFAILed, the STx_C will 
fail. No data will be written. 


When the write buffer gets to a STx_C in non-cached memory it will probe the Scache to check 
the block state. It will miss. The state of the Lock flag will be ignored. The BIU will request a 
WRITE BLOCK LOCK. If the WRITE BLOCK LOCK is CACKed, the Ibox is started as above. 
If the WRITE BLOCK LOCK is CFAILed the STx_C will fail. No data will be written. 


4.1.9 Clocks 


4.1.9.1 CPU Clock 


External logic will supply DECchip 21164-AA with a differential clock at twice the desired internal 
clock frequency via the CLK_IN_H and CLK_IN_L pins. DECchip 21164-AA divides this clock 
by two to generate the internal chip clock. 
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4.1.9.2 System Clock 


The CPU clock is divided by a programmable value between 3 and 15 to generate a system clock, 
which is supplied to the external interface via the SYS_CLK_OUT1_H,L pins. Table 5—1 for the 
valid ratios of System clock to CPU clock. 


SYS_CLK_OUT1 is delayed by a programmable number of CPU cycles between 0 and 7 to produce | 
SYS_CLK_OUT2_H, L. 


The output of the programmable divider is symmetric if the divisor is even, and asymmetric with 
SYS_CLK_OUT1_H and SYS_CLK_OUT2_H TRUE for one extra CPU cycle if the divisor is odd. 


The false-to-true transition of the SYS_CLK_OUT1_H is the "Sysclock" used as a timing reference 
throughout the specification. 


4.1.9.3 Reference Clock 


The SYSTEM may supply a reference clock to which DECchip 21164-AA will synchronize SYS_ 
CLK_OUT1_H. To do this the frequency of SYS_CLK_OUT1 must be ever so slightly higher than 
that of REF_CLK_IN. This will cause the rising edge of SYS_CLK_OUT1 to drift back towards 
the rising edge of REF_CLK_IN. DECchip 21164-AA will detect when the edges meet and stall the 
internal clock generator for one CLK_IN cycle. This will move the rising edge of SYS_CLK_OUT 
back in front of REF_CLK_IN. Figure 4—12 attempts to show this timing. 


Figure 4-12: Reference Clock Timing 


v-stall cycle 


CPU_IN tededvdedvo ded ladle tedelelide betedad tel oles lilabide 
CLK eell..11..21..11..11..11..11..11...11..11..11..11..11..11..11.. 
SYS_CLK_OUT1 ..11111111........ ae oe sls Sis Seer LALLA aes ALLL es 
REF_CLK_IN pbELLITA Ds oe we eee ss ee err TELIA) otc eeas LLLITALL. cae 


4.1.9.4 Sysclock to Bcache cycle time ratios 
The Beache cycle time may be faster, the same, or slower than the Sysclock. 


Reads and writes that are private to DECchip 21164-AA and the Bcache may start on any CPU 
clock. There is no relation between the Sysclock and the Bcache accesses. 


If the SYSTEM is involved in a Beache transaction, each read or write will start on a Sysclock. It 
is up to the SYSTEM to control the rate of the Bcache transactions using the DACK wire. 


The Beache will be written during WRITE BLOCK, WRITE BLOCK LOCK, READ DIRTY, and 
READ DIRTY INV commands that source data from the Scache. The write of the first part of 
the block will start in the Sysclock that drove the command/response and address to the SYSTEM. 
The SYSTEM must allow enough time for the write to complete before asserting DACK. The next 
write will start on the Sysclock edge that DACK was asserted on. 


When DECchip 21164-AA receives the fill indication for the SYSTEM it will start writing the 
Bcache in the N+2 Sysclock. At the end of the write time, DECchip 21164-AA will wait for the 
next Syselock edge. If DACK is not asserted, the Bcache write will begin again at the same 
index. If DACK is asserted, the index will advance to the next part of the fill and the write will 
begin again. The SYSTEM must provide the data and DACK signal at the correct Sysclock edges 
to complete the fill correctly. 
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4.1.10 Tri-state Overlap 


The ADDRESS/CMD bus and the DATA/TAG bus must be operated in a way that prevents more 
than one driver from driving the bus at a time. This section will describe the features in DECchip 
21164-AA that might be used to prevent tri-state overlap. 


The owner of each bus must drive the bus to some value each cycle. 


In general DECchip 21164-AA assumes that its drivers turn on and off very fast (0.5ns to Ins 
range). SRAMs turn on and off slowly. System drivers fall someplace in the middle. 


Figure 4—13 shows DECchip 21164-AA and the SYSTEM taking turns driving the CMD/ADDRESS 
bus. If ADDR_BUS_REQ is asserted at the end of a Sysclock 0, the next cycle on the 
CMD/ADDRESS bus belongs to the SYSTEM. DECchip 21164-AA will turn off it’s drivers at the 
start of the Sysclock 1. The SYSTEM must turn on it’s drivers during Sysclock 1, but must in- 
sure that the driver doesn’t turn on before DECchip 21164-AA turns off. DECchip 21164-AA will 
sample the state of the CMD/ADDRESS bus at the end of Sysclock 1. 


If ADDR_BUS_REQ remains asserted, the SYSTEM should continue to drive the CMS/ADDRESS 
bus. 


To pass the bus back to DECchip 21164-AA, the SYSTEM should turn off its drivers during a 

~ Sysclock and de-assert ADDR _BUS_REQ. DECchip 21164-AA will not sample the state of the 
bus if ADDR_BUS_REQ is de-asserted. At the next Sysclock edge, DECchip 21164-AA will drive 
the bus. 


Figure 4-13: Driving the CMD/ADDRESS Bus 


fy) st 2 3 
SYS CLK OUTIL kana LLL11111....... LYE IT ss takes 11111111..... 
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DECchip 21164-AA samples here * 


The DATA bus can be driven by DECchip 21164-AA, the Beache, or the SYSTEM. 


For DECchip 21164-AA Beache Writes followed by DECchip 21164-AA Bcache Reads, we assume 
that DECchip 21164-AA stops driving the DATA bus well in advance of the Beache turning on. 


For DECchip 21164-AA Bcache Reads followed by DECchip 21164-AA Bcache Writes, DECchip 
21164-AA will insert a programmable number of CPU cycles between the read and the write. This 
will allow time for the Bcache drivers to turn off before turning on the DECchip 21164-AA data 
drivers. These rules apply to WRITE BLOCK, WRITE BLOCK LOCK, READ, READ DIRTY, and 
FLUSH commands as well. 


DECchip 21164-AA will not prevent tri-state overlap at the start of a fill. The SYSTEM must assert 
IDLE_BC early enough to allow all the drivers to turn off before the SYSTEM turns on its drivers. 


At the end of the Fill, DECchip 21164-AA will wait READ->WRITE programmable number of CPU 
cycles before starting a read or write. This time should allow the SYSTEM to turn off it’s drivers. 
If this is not enough time, the system may assert DATA_BUS_REQ to gain addition cycles. 
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4.1.11 Restrictions 


This section will document restrictions on the use of DECchip 21164-AA interface features. 


4.1.11.1 Fills after other transactions 


If the system is removing data from DECchip 21164-AA with any of the system commands, or 
if the system is removing a Bcache victim from the Bcache and it wants to follow any of these 
transactions with a fill, then the earliest assertion of the FILL signal is the Sysclock after the 
last DACK. 


Fills followed by Fills is a special case. Fills can be pipelined back to back to use 100% of the 
data bus bandwidth. 


This restriction may be lifted in the future. 


4.1.11.2 Sending System commands 


A SYSTEM can send up to TWO commands to DECchip 21164-AA. It must then wait for the 
assertion of the RES_H signal for the first command before it can send the third command. 


4.1.11.3 CACK for WRITE BLOCK commands 


When DECchip 21164-AA requests a WRITE BLOCK or WRITE BLOCK LOCK, the SYSTEM can 
DACK the data before asserting CACK. The SYSTEM must assert CACK no later than the last 
DACK. 


4.1.11.4 No Bcache Systems 


SYSTEMs without a Bcache must have a block size of 64 bytes and all three sets in the Scache 
must be enabled. 


4.1.11.5  Scache duplicate tag store 


SYSTEMs without a Bcache that do have an Scache duplicate tag store are also required to maintain 
tags for the two blocks in the DECchip 21164-AA Scache victim buffer. 
NOTE 


FETCH and FETCH_M commands will no longer be auto acked by DECchip 21164-AA. 
They will always be driven to the SYSTEM for acknowledgement. 


SET DIRTY, LOCK, and MB commands have been merged in to a single command 
group in the BC_CONTROL<EI_OPT_CMD> ipr. 
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4.1.12 ECC/Parity 


The chip will support INT8 ECC for the external Bcache and memory system. ECC will be 
provided by the CPU for each INTS8 that is written into the Bcache. Fill data read from the 
Beache and memory will be checked by hardware. Uncorrected data will be sent to the Deache, 
and register files. Single bit errors will be corrected by hardware. The Scache and Icache will be 
filled with corrected data. Double bit errors will be detected. If the SYSTEM has indicated that 
the data should not be checked, no checking or correcting will be performed. 


Each data bus cycle will deliver one INT16 worth of data. ECC is calculated as ECC(data<63:0>) 
and ECC(data<127:64>). This allows ECC to be calculated on each side of the chip. Figure 4—14 
shows the code. Two IDT49C460 or AMD29C660 parts can be cascaded to produce this ECC code. 
A single IDT49C466 will also support this ECC code. 


The code provides single bit correct, double bit detect, and all 1’s and all 0’s detect. 


If the DECchip 21164-AA is in parity mode, it will generate byte parity and place it on the DATA_ 
CHECK_H<15:0> for writes. Parity will be checked for reads. Parity for data<7:0> will be driven 
on DATA_CHECK_H<0> and so on. 


Figure 4-14: ECC code 


11 1111 1111 2222 2222 2233 3333 3333 4444 4444 4455 5555 5555 6666 cccc cece 
0123 4567 8901 2345 6789 0123 4567 8901 2345 6789 0123 4567 8901 2345 6789 0123 0123 4567 
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CBS 6 erie’ wealine PETE WLI tc tees 1111 1111 .... .... TQLL, LLL sates wits PETIOLES eteleree se lista 
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CB2 and CB3 are calculated for ODD parity (an odd number of “"1"s counting 

the CB) 


CBO, 


of 


CBl, CB4, CB5, CB6, CB7 are calculated for EVEN parity (an even number 
"1"s counting the CB) 
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For x4 RAMs, Dave Hartwell has provide the following bit arrangement that will detect nibble 
errors. 


Figure 4-15: x4 bit arrangement 


CBO cBl CBS CB6 


CB2 bo D4 DS 
CB3 CB4 D7 D8 
CB7 D2 D3 Dili 
DI D6 D10 D13 


D9 D14 bis D2l 
B12 D16 D17 D22 
D15 D119 D20 D23 
B24 D25 D27 D30 
D26 D28 D29 D31 
D32 D34 D35 D37 
D33 D36 D38 D40 
D39 D4l D43 D46 
D42 D44 D45 p47 
D48 DSO DS1 DS3 
Dag DS2 B54 D56 
DSS D57 DS59 D6é2 
DS8 D60 Dél D63 
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4.2 Revision History - 


Tabie 4-9: Revision History 


Who 


Pete Bannon 
Pete Bannon 
Pete Bannon 
Pete Bannon 
Pete Bannon 
Pete Bannon 
Pete Bannon 


4-28 €xternai Interface 


When 


12/16/91 
12/31/91 
3/ 1/92 
3/27/92 
3/27/92 
4/21/92 
11/30/92 


Rev 


0.8 
0.9 
1.0 
1.2 
1.3 
1.4 
1.5 


Description of change 

DRAFT 0.8 text 

DRAFT 0.9 text 

FILL ERROR, new non-cached read 
New WS focus interface 

New victim sequence 

New ECC code 

general update 
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Chapter 5 


Reset and Initialization 


5.1 SYS _RESET_L and DC_OK_H 


The DECchip 21164-AA reset process starting from a powered off state uses two input signals, 
SYS_RESET_L and DC_OK_H. Until power has reached the proper operating point, DC_OK_H 
must be deasserted and SYS_RESET_L must be asserted. After power has reached the proper 
operating point, DC_OK_H is asserted. After that, SYS_RESET_L is deasserted. 


From a powered on state, the reset sequence begins with SYS_RESET_L assertion. In any case, 
after SYS_RESET_L is deasserted, DECchip 21164-AA begins a sequence of operations: Icache 
BiSt, followed by an optional automatic Icache initialization via an external serial ROM interface, 
and finally dispatching to the RESET PALcode trap entry point. 


If DC_OK_H is not asserted, SYS_RESET_L is forced asserted internally. 


SYS_RESET_L forces the CPU into a known state. Chapter 3 gives the reset state of each IPR 
and Section 9.1 gives the reset state of the pins. 


While DC_OK_H is deasserted, DECchip 21164-AA provides its own internal clock source from 
an on-chip ring oscillator. When DC_OK_H is asserted, the DECchip 21164-AA clock source is 
the differential clock input pins, CLK_IN_H and CLK_IN_L. 


SYS_RESET_L must remain asserted while DC_OK_H is deasserted and for a period of time 
after DC_OK_H assertion which is at least TBD internal CPU cycles in length and at least TBD 
Sysclock cycles in length. After that, SYS_RESET_L is deasserted. SYS_RESET_L deassertion 
generally should be synchronous with respect. to Sysclock. 


ISSUE 
Does DECchip 21164-AA have to support asynchronous deassertion of SYS_RESET_L? 


When DECchip 21164-AA is running off the internal ring oscillator, the internal clock frequency 
is in the range TBD. Also the Sysclock divisor ratio is forced to TBD and the SYS_CLK_OUT2_x 
delay is forced to TBD. After DC_OK_H is asserted, the Sysclock divisor and SYS_CLK_OUT2_x 
delay are determined by input pins while SYS_RESET_L remains asserted. See Section 5.2. 
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5.1.1. Power Up Requirements 
The DECchip 21164-AA chip uses a 3.3V power supply. This 3.3V power supply must be stable 
before any input or bidirectional pin rises above 4V. 


The VREF_H input pin must have reached the correct stable operating point before DC_OK_H 
is asserted. See Chapter 7. 


5.1.2 Pin State with DC_OK_H Not Asserted 


While DC_OK_H is not asserted (and SYS_RESET_L is asserted), every output and bidirectional 
DECchip 21164-AA pin is tristated and pulled weakly to ground by a small pull-down transistor. 


5.2 Sysclock Ratio and Delay 


While in reset, DECchip 21164-AA reads Sysclock configuration parameters from the interrupt 
pins. Table 5—1 shows how the Sysclock divisor is determined and Table 5-2 shows how the 
SYS_CLK_OUT2_x delay is determined. These inputs should be driven with the correct configu- 
ration whenever SYS_RESET_L is asserted. When these inputs change while SYS_RESET_L is 
asserted, it takes TBD internal CPU cycles before the new Sysclock behavior is correct. 


" Table 5-1: System Clock Divisor 
IRQ_H<3> IRQ H<2> IRQ H<1l> IRQ_H<0> Ratio 


L L H H 3 

L H L L 4 

L H L H 5 

L H H L 6 

L H H H 7 

H L L L 8 

H L L H 9 

H L H L 10 

H H H H 15 

all other values unspecified effect 
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Table 5-2: System Clock Delay 


SYS_MCH_CHK. PWR_FAIL_IRQ. MCH_HLT_IRQ_ 
IRQH H 


= 


Delay 


mmerermmeh) 


mit mimeo 
mem em oe mo 
Noor, OO N & © 


5.3 BiSt 


Normally upon deassertion of SYS_RESET_L, DECchip 21164-AA automatically executes Icache 
BiSt (Built in Self-test). If PORT_MODE_H<1> is asserted, the test port is in debug test interface 
mode and BiSt is bypassed. Otherwise, the Icache is automatically tested and the result is made 
available in ICSR and on TEST_STATUS_H<0>. Internally, the CPU chip reset continues to be 
asserted throughout the BiSt test process. 


5.4 Serial ROM 


After Icache BiSt completes, an optional serial ROM Icache load sequence begins. If SROM_ 
PRESENT_L was not asserted when SYS_RESET_L transitioned to deasserted, the serial ROM 
load process is skipped, internal CPU reset is deasserted, and PALcode execution begins at the 
RESET trap entry point. If SROM_PRESENT_L was asserted when SYS_RESET_L transitioned 
to deasserted, the serial ROM load sequence is completed prior to deassertion of internal CPU 
reset and PALcode execution at the RESET trap entry point. 


Figure 5—1 gives a timing diagram of a serial ROM load sequence. Chapter 11 describes the 
format of the Icache data. Every data and tag bit in the Icache is loaded by this sequence. 


Figure 5-1: Serial ROM Load Timing 


SYS RESET L [RR mmm nnn men ne ne ne ee te ee ie i 





SROM OF DL teen eee nn \ [ann 
SROM_CLK_H Jmmn\__fame\ fm oN 
sample SROM_DAT_H - a “ 
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5.5 Cache Initialization 


Regardless of whether Icache BiSt is executed, the Icache is flushed during the reset sequence 
prior to serial ROM load. If serial ROM load is bypassed, the Icache is initially in the flushed 
state. 


The Scache is flushed and enabled by internal reset. This is required if serial ROM load is 
bypassed. The initial [stream reference after reset is location 0. Since that is a cacheable-space 
reference, it will probe the Scache. 


The Beache is disabled by reset. 
The Deache is disabled by reset. It is not initialized or flushed by reset. 


5.6 BIU initialization 


After reset, the Cbox is in the default configuration dictated by the reset state of the IPR bits 
which select the configuration options. (Note that the Bcache configuration registers are not 
initialized by reset.) The Cbox response to system commands and internally generated memory 
accesses will be determined by this default configuration. Systems should be compatible with ths 


, default configuration or arrange to change it before initiating any accesses to cacheable space. 


Since the initial PALcode trap entry point is in cacheable space, system environmennts which are 
not compatible with the default configuration must utilize the serial ROM Icache load feature to 
initially load and execute a PALcode program which will configure Cbox IPRs as needed. 


5.7 Unitialized state 


A number of IPR bits are not initialized by reset. These are error reporting registers and some 
other IPR states. These must be initialized by initialization PALcode. 


5.8 Timeout Reset 


The Ibox contains a timeout timer which times out when a very long period of time passes with 
not one instruction completing. When this timeout occurs, an internal reset event occurs which 
clears sufficient internal state to allow the CPU to begin exeuting again. Registers, IPRs, and 
Caches are not affected. Dispatch to the PALcode MCHK trap entry point occurs immediately. 


5.9 Clock Reset 


A TBD method will exist which allow a chip tester to initialize the Sysclock divider logic. This 
allows for deterministic operation during chip test. Due to the size of internal logic propagation 
delays as compared to the normal speed of the internal CPU clock, it will be necessary to run the 
internal CPU clock at a low speed while initializing the Sysclock divider. 
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5.10 IEEE 1149.1 Test Port Reset 


TRST_L must be asserted whenever SYS_RESET_L is asserted or DC_OK_H is deasserted. 


Continuous TRST_L assertion during normal operation can be used to prevent the IEEE 1149.1 
Test Port from affecting DECchip 21164-AA operation. 
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5.11. Revision History 


Table 5-3: Revision History 


Who 


JHE 
JHE 


5-6 Reset and Initialization 


When 


1-March-1992 


30-November- 
1992 


Description of change 


Brief statement of plan. 


Update 
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Chapter 6 


Error Handling 


6.1 Overview 











This is an overview of DECchip 21164-AA’s error hi 
Deache, and Scache) implements parity protection for “tag 
mented for memory and Beache data. (The implementati 


‘ategy. Each internal cache (Icache, 
and data. ECC protection is imple- 
provides detection of all double-bit 
Istream and Dstream ECC errors are 





corrected in hardware without PALcode inté¢ i cache tags are parity protected. The Ibox 
implements logic which detects wh been made for a very long time (a TBD 
number of CPU cycles of issue sta tely repeated traps) and forces a machine check trap. 
PALcode handles all error traps. d correctable error interrupts). At the time 
the error is handled, PALcode bur me pointed to by the HWRPB. 


Where possible, the addres ee to the operating oe Most of the 
Istream errors are retryabk 


memory location. 


6.2 Error Flows 


6.2.1 


dation: Flush the Icache early in the MCHK routine. 
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6.2.2 


6.2.3 





















Scache data parity error - Istream 


Machine check occurs before the instruction causing the pari 
Bad data may be written to the Icache or Icache Refill Bu : 
Retryable if there are no multiple errors. 

Recommendation: Flush the Icache to remove bad data. 


tions). Then flush the Icache again. 
SC_STAT: SC_DPERR<7:0> set , SC_SCND_ERR s 
SC_STAT: CBOX_CMD is IRD 


SC_ADDR: Contains the address of the 32B 
which octaword was accessed first, but the e 


data, another parity error may rési uri Writeback (this is a reason not to attempt 
this in palcode, since a MCHE fe 


Machine check occurs b i Hon causing the parity error is executed. 
Bad data may be writtéa® r Icache Refill Buffer and validated. 


Not retryable. Probal o recover by deleting a single process because the 





e to remove bad data. The Icache Refill Buffer may be 


dress of the 32B block containing the error. (Note: bit4 indicates 
ssed first, but the error may be in either octaword). 


ty error occurs early in the PALcode routine at the machine check 
nite loop may result. 


6-2 Error Handling DIGITAL CONFIDENTIAL 


6.2.4 


DEC Chip 21164-AA (EV5 CPU) Specification, Revision x, October 1993 

















Scache data parity error - Dstream read/write, REA 


Machine check occurs. Machine state may have changed. 


Not retryable, but may only need to delete the process if 
and no second error occurred. 


SC_STAT: SC DEERE <C> set , SC_SCND_ERR set if th 


SC “ADDR: Contains the sae of the 328 block c 
which octaword was accessed first, but the error m 


Not retryable. Probably won't be able to 
exact address is unknown. 


SC_STAT: SC_TPERR<7:0> set , SC_SCND_ERR ere are multiple errors 
SC_STAT: CBOX_CMD is DRD, DWRITE, READ_DIRTY, SET_SHARED, or INVAL 


tion with error. 


g a single process because the 


and no second error occ 


DCPERR_STAT: DPO 
multiple parity error: 
bit will be set. 


VA: Contains the 


known, and a load may have falsely hit. 
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6.2.8 Istream uncorrectable ECC or data parity errors (Bc 


¢ Machine check occurs before the instruction causing the err 
¢ Bad data may be written to the Icache or Icache Refill Bu 
¢ Retryable if there are no multiple errors. 


¢ Must flush Icache to remove bad data. The Icache Refill Bik 
enough instructions to fill the refill buffer with new data (32 ir ins 
Icache again. 

¢ EILSTAT: UNC_ECC_ERR set, SEO_HRD_ERR set* 

¢ EJ_STAT: ELES set if source of fill data is memory/syst 

* EISTAT: FIL_IRD is set 

¢ EJADDR: contains the physical address bit: 

e FILL_SYN: contains syndrome bits associ 
tains byte parity error status if in parity m 

¢ BC_TAG_ADDR: holds results of external caché 
this transaction. 

¢ Note: Ifthe Istream ECC or panty ¢ . 


“flush” the block of data out of 
index, but a different tag. If t 
data" has been replaced. If, “marked dirty, then when the new data tries 
to replace the old data, anoth ror may result during the writeback (this is a 
reason not to attempt th yalec a MCHK from palcode is always fatal). 


6.2.9 





"error status if in parity mode. 
ds results of external cache tag probe if external cache was enabled for 
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6.2.10 Bcache tag parity errors - Istream 


* Machine check occurs before the instruction causing the erro 
¢ Bad data may be written to the Icache or Icache Refill Buffe 
¢ Retryable if there are no multiple errors. 


¢ Must flush Icache to remove bad data. The Icache Refill B 
enough instructions to fill the refill buffer with new data (32 in ms). Then flush the 
Icache again. 


¢ EISTAT: BC_TPERR or BC_TC_PERR set, SEO_ Hi 
e EI_STAT: EI_ES clear 

e EISTAT: FIL_IRD is set 
© EI_ADDR: contains the physical address bits 
* BC_TAG ADDR: holds results of external cé 
t the parity bit. The victim is 
© the control field parity. PALcode 


can distinguish fatal from non-fatal occurrences by ck g for the case in which a potentially 


dirty block is replaced without the victim being pre : 








and no second error occur 
bit. The victim is proce 







the status bits in the tag, ignoring the control field 
#:feom non-fatal occurrences by checking for the case in 
replated without the victim being properly written back and 
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¢ When DECchip 21164-AA detects a command or address parity 
ditionally NOACKed. 


1s uncon- 


6.2.13 System reads of the Bcache 


¢ DECchip 21164-AA does not check the ECC on outgoing Bea 2 
processor will detect it. 


bad, the receiving 


6.2.14 Istream or Dstream correctable ECC error te memory) 


¢ DECchip 21164-AA hardware corrects the data before “filiig the Scache and Icache. The 
Deache is completely invalidated. The data 3 e Beact ntains the ECC error, but is 
scrubbed by palcode in the correctable errorin ot: routie. (Using LDxL, STxC. If the 
STxC fails, the location can be assumed to He'Scruk He 


¢ A separately maskable correctable error inté: 
(Masked by clearing ICSR<crde>.) 


¢ ISR: CRD set. 
¢ EI_STAT: COR_ECC_ERR set. 










¢ EI_STAT: ELES clear if source 
EI_ADDR: cohiains the Dryer ; 


‘set otherwise. 
*4 of the octaword associated with the error. 
with the octaword containing the ECC error. 









srence to DECchip 21164-AA. If the system environment expects 
d detect them. If it does not expect them (as might be true in 
ory access timing), it is likely that the internal Ibox timeout 
#41] if a fill fails to occur. To properly terminate a fill in an error 
OR | A: pin is asserted for one cycle and the normal fill sequence involving 
and DACK pins is generated by the system environment. 


tatus is saved to show that this happened. If necessary, systems must 
tus, and include reads of the appropriate status register(s) in the MCHK palcode. 
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DECchip 21164-AA has a maskable machine check interrupt, 
environments to signal fatal errors which are not direct] 
DECchip 21164-AA. It is masked at IPL 31 and anytime 


ISR: MCK set. 


ed by system 
d access from 


ibox timeout 
When the Ibox detects a timeout, it causes a PALco ACHK entry point. 
Simultaneously, a partial internal reset occur sxcept IPR state is reset. This 
should not be depended on by systems in whi ccur in typical use (e.g., oper- 
ating system or console code probing locati srmine if certain hardware is present). 
The purpose of this error detection mechanigii is tegttempitto prevent system hang in order 


to write a machine check stack frame. 
ICPERR_STAT: TMR set. 


Assertion of CFAIL_H in a sysclaé 
21164-AA to immediately execu 


PALcode trap to the MCHK exit 
Simultaneously, a partial int 
ICPERR_STAT: TMR set. 


This can be used to restqj 
state after the externak 


ACK_H is not asserted causes DECchip 


p-AA and the external environment to a consistent 
tects a command or address parity error. 


: to differentiate the CFAIL_H/no CACK_H case from 
ary, systems must save this status, and include reads of the 
e MCHK palcode. 


; : bad data on Istream errors. The Icache Refill! Buffer may be 
enough instructions to fill the refill buffer with new data (32 instruc- 


chk> set, THEN HALT (reason_for_halt = dbl_mchk) (???need to unlock regs???). 
s<mchk>. 

© clear out Mbox/Chox before reading Chox registers or issuing DC_LFLUSH. 
Flush Deache to remove bad data on Dstream errors. 
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¢ Read ICSR. 

¢ Read ICPERR_STAT. 
¢ Read DCPERR_STAT. 
¢ Read SC_ADDR. 


¢ Use register dependencies or MB to ensure read of SC_AD 
of SC_STAT. 


¢ Read SC_STAT (unlocks sc_addr). 
¢ Read EIADDR, BC_TAG_ADDR, FILL_SYN. 


¢ Use register dependencies or MB to ensure reads of 
finish before subsequent read of EI_STAT. 


¢ Read EI_STAT and save (unlocks EIADDR,B 

¢ Read EI_STAT again to be sure it is unlocke 

¢ Check for non-retryable cases. If any one 
e EISTAT<tperr> 
e EILSTAT<te_perr> 
e EILSTAT<ei_par_err> 
¢ ELSTAT<seo_hrd_err> 
¢ EI_STAT<unc_ecc_err> AND NOT. 
¢ DCPERR_STAT<lock> 
¢ SC_STAT<se_send_err> 
e SC_STAT<sc_tperr> 
¢ NOT (SC_STAT<cmd ; STAT <sc_dperr> 
¢ ICPERR_STAT<tmr 


true, then skip retry: 


¢ ISR<mck> 

« If none of the above: en éither we have a retryable iread, or the source of the 
MCHK is a FILL_ERROR: dd code for query of system status 

¢ Set the retry fla if any one or several of the following are true ( and none 
of the above con 


SD (SC_STAT<cmd> == IRD) 


e, including the following IPRs: 


e ICSR 
¢ ICPERR_STAT 
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¢ DCPERR_STAT 
¢ MM _STAT 
¢ VA (read unlocks VA and MM_STAT) 
¢ SC_ADDR from register file 
¢ SC_STAT from register file 
¢ BC_TAG_ADDR from register file 
¢ EI_ADDR from register file 
¢ FILL_SYN from register file 
¢ EI_STAT from register file 
¢ LD_LOCK 
¢ unlock the following iprs: 
¢ ICPERR_STAT (write 0x1800) 
¢ DCPERR_STAT (write 0x03) 
¢ VA, SC_STAT, and EI_STAT are already u 
¢ Check for arithmetic exceptions: 
¢ Read EXC_SUM. 
¢ Check for arithmetic errors 


¢ If arithmetic error found, gi routine (which builds stack frame and 


returns back here) 
¢ Clear EXC_SUM (unlock 


¢ Report the Processor Uncorr 
ments. 


ording to Operating System specific require- 


¢ Read MC it MCE se> is set, set SECOND ERROR FLAG in the logout frame, skip 
#a.to scrub memory location routine. 


Tare 
code to be processor correctable error and write. 


_ADDR, FILL_SYN, EI_STAT (BC_TAG_ADDR is unpredictable, so don’t save 


He imemory location by using LDQ_L/STQ_C to one of the quadwords in each octa- 
word of thé:Bcache block whose address is reported in EI_ADDR. No need to scrub IO space 
addresses as these are non-cacheable. 
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e ACK the CRD Interrupt by writing a "0" to HWINT_CLR<cerdc>: 
¢ If MCES<dpc> is set (logging disabled), dismiss the interrupt. 
© Set MCES<pce>. 


¢ No need to unlock any registers because conditions that waxlt would also cause 
a MCHK. VA will not be locked because DTB_MISS and F? 
interrupted. 


ments. 
¢ NOTE: Only read EI_STAT once in the CRD flow, 


status. 


6.5 MCK_INTERRUPT Flow 


¢ Got here through interrupt routine because ISR<MCKS bit set. 
¢ Read MCES ; 
¢ If MCES<mchk> set, THEN HALS.(rea ait = dbl_mchk) (???need to unlock regs???) 








¢ Set MCES<mchk> : 
* Follow MCHK frame buildin ae ‘set retry flag. 
¢ Report the System Uncorre nMCHK <wetording to Operating System specific require- 


ments. 
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6.7 Revision History 


Table 6-1: Revision History 
Who When 


1-March-1992 








JEM 13-Nov-1992 S 
JHE 19-Dec-1992 Edits for new relruse. 
JEM 6-May-1993 Added more detailed: 

2-Sep-1993 , 
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Chapter 7 


DC Characteristics 


7.1 Overview 


DECchip 21164-AA is capable of running in a CMOS/TTL environment or an ECL environment. 
The chips will be tested and characterized in a CMOS environment. The specifications below 
assume a CMOS/TTL environment. Differences for an ECL environment are noted in Section 7.2. 


7.1.1. Power Supply 


In CMOS mode the VSS pins are connected to 0.0V, and the VDD pins are connected to 3.3V, +/- 
5%. 


The VREF_H analog input should be connected to a 1.4V +/-10% reference supply. 


7.1.2 Input Clocks 


CLK_IN (H,_L) is expected to be a differential signal generated from an ECL oscillator circuit, 
although non-ECL circuits may also be used. It may be AC coupled, with a nominal DC bias of 
VDD/2 set by a high-impedence (i.e. >1K) resistive network on chip. It need not be AC coupled 


if VDD is used as the VCC supply to the ECL oscillator. See the AC Characteristics chapter for 
more detail. 


DIGITAL RESTRICTED DISTRIBUTION DC Characteristics 7-1 


DEC Chip 21164-AA (EV5 CPU) Specification, Revision 1.9, December 1992 


7.1 


3 Signal pins 


Input pins are ordinary CMOS inputs with standard TTL levels, see Table 7-1. Once power has 
been applied and VREF_H has met its hold time, the majority of input pins can be driven by 5.0V 
(nominal) signals without harming DECchip 21164-AA. There are some signals that are sampled 
before VREF_H is stable, and these signals can not be driven above the power supply. These 
signals are: 

¢ DC_OK_H 

¢ ECL_OUT_H 

¢ TRST_L 

¢ TDI_H 

¢ TDO_H 

* TMS_H 

¢ TCK_H 


Output pins are ordinary 3.3V CMOS outputs. Although output signals are rail-to-rail, timing is 
specified to standard TTL levels, see Table 7-1. 


Bidirectional pins are ordinary 3.3V CMOS bidirectional. On input, they act like input pins. On 
output, they drive like output pins. 


Once power has been applied, input (except noted above) and bidirectional pins can be driven to 
a maximum DC voltage of 5.5V without harming DECchip 21164-AA (it is not necessary to use 
static RAMS with 3.3V outputs). 


Table 7-1: CMOS DC Characteristics 


Parameter Requirements 

Symbol Description Min Max Units Test Conditions 
TTL Inputs/Outputs 
Vih High level input voltage 2.0 V 
Vil Low level input voltage 0.8 v 
Voh High level output voltage 24 Vv Ioh = -100uA 
Vol Low level output voltage 0.4 Vv lol = 3.2mA 
Power/Leakage 
Icin Clock input Leakage -50 50 uA -0.5<Vin<5.5V 
Til Input leakage current 10 10 uA 0<Vin<Vdd V 
Tol Output leakage current (three- -10 -10 vA 

state) 
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7.2 ECL 100K Mode ~ 


In ECL 100K mode a combination of on-chip and off-chip circuits provide ECL 100K compatible 
interfaces. 


7.2.1 Power Supply 


In ECL 100K mode the VDD pins are connected to 0.0V, and the VSS pins are connected to -3.3V, 
+/~ 5%. 


7.2.2 Reference Supply 


In ECL 100K mode the VREF_H input is connected to a reference supply at VDD-1.3V. The best. 
way to generate the reference supply is to use the VBB output provided by several chips, such as 
the ECLinPS MC100E111. 


7.2.3 Inputs 


In ECL 100K mode inputs appear to be ordinary ECL 100K inputs, with the exception that they 
lack the pull down resistor that is normally present in ECL 100K circuits. 


7.2.4 Outputs 


In ECL 100K mode external resistors create the correct ECL 100K levels. The following stylized 
circuit is used. 


| 

CPU |---~--- {RL |------------- | ECL 100K 
| tone t I | 
| 50 ohms +t | 


|}2} 100 ohms 


7.2.5 Bidirectionals 


In ECL 100K mode the bidirectional pins should be converted into unidirectional input and output 
busses as close to DECchip 21164-AA as possible. The DECchip 21164-AA chip bidirectional bus 
is buffered and driven onto the system output bus. The system input bus is driven onto DECchip 
21164-AA’s bidirectional bus using cut-off drivers controlled by the CPU’s output enables. 


The same resistor network used on output pins is used on bidirectional pins. 
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7.3 Power Dissipation 


Table 7-2 Shows the estimated power maximum consumption at 286Mhz. Power consumption 
scales linearly with frequency in the frequency range 225Mhx to 312Mhz. 


Table 7-2: DECchip 21164-AA Estimated Power Dissipation @Vdd=3.45V 
Speed Min Typ Max Units 
286 Mhz TBD TBD 60 Watts 
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7.4 Revision History , 


Table 7-3: Revision History 


Who When Description of change 
Pete Bannon December 16, Include EV4 text 

1991 
JHE December16, Updates 

1992 
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Chapter & 


AC Characteristics 


8.1 Overview 










or TTL environment. Timing parameters are given fo¥ 
operating at an internal frequency of 294 MHz (3.4 ns). 


8.2 Clocking Scheme 


The input clock pins OSC_CLK_IN mn at 2x"the internal frequency of the time base for 
DECchip 21164-AA. Input clock i 


for internal distribution. 


System designers have a choigetng ‘locking schemes to run DECchip 21164-AA syn- 
chronous to the system. 


1. DECchip 21164-AA 


2. DECchip 21164-AAawill 


system clock. 
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8.2.1 Input Clocks 
The input clocks OSC_CLK_IN_H,_L provide the time-base for DECchip:; a n DC_OK_H 


is asserted. The terminations on these signals are designed to be cg#ipatible.: s¥stem oscillators 
of arbitrary DC bias. The schematic equivalent is shown below: 


Figure 8-1: OSC _CLK_IN_H, L Terminations 








ZO = 30 OHMS 

TD = 200 pS 
eee N es settee Z pe walns ASN _ 
| 
| ' to diff-amp 
; Package Pin —— i —_-—_ iy Chip Pad 

; t 
| 
50 OHMS 


Ca OPIS 


High Resistance 






















impedance bias drive 
21164-AA. The peak- 
seen by DECchip 2136 i “square-wave” or a sinusoidal source may be used. Note 


Table 8—1: 
Nominal Bin 
ns 
17.0 ns 


50 +/- 10 percent 
V (peak to peak) 


OSC_CLK* 
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0 










8.3 Pin Characteristics 


All DECchip 21164-AA input pins are TTL compatable with the exg 
and TEMP_SENSE. All output pins are TTL compatable. 


8.4 Back-up Cache Loop Timing 


DECchip 21164-AA initiated private Bcache read or write (S 
pendent of the system clocking scheme. Bcache loop#timing 
DECchip 21164-AA cycle time. Outgoing Beache ind 
edge and the in-coming Beache tag and data pins a 


Table 8-2: Output Driver Characteristics : 
Spec 40pF Load 10pF Load Name 




















Maximum Driver Delay 
Minimum Driver Delay 


Table 8-3: Bcache Loop Timing 
Pin 


paTa<127:0> 


paTa<127:0> input hof Tdh 

INDEX<25:4> output: - Tdd"s 0.4nS Tiod 1 
INDEX<25:4> Tmdd Tioh 

pata<127:0> i ‘dd + Teycle + 0.4nS Tdod 1 
para<127:0> . Tmdd + Teycle Tdoh 


“driver mismatch and clock skew 
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Figure 8-2: BCache Timing 


BCache Loop (READ) 














Too 
ae 

seh mei MY 

comm FQQEEOORRE ORORTOOO OOOO OO OO > {RoR 





Tou 
BCache Cycle 





TpoH 


Tiou 








CPUCLK 











INDEX Out 


























DATA Ou 











BCache Cycle ere 
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8.4.1 SYS CLK based systems 


Systems which use the SYS_CLK_OUT1_H,L outputs of DECchip 2 
will have the following timing. Note that all timing is with respect, 
CPU CLK, this allows the setup and hold times to be specified indég 
loading of SYS_CLK_OUT1, ADDR, DATA and command pins. REE: 


% 


for proper operation. , 


elative capacitive 
st be tied to VDD 


Table 8-4: Systems Using SYS CLK 





SYS_CLK_OUT1 _ output delay 


SYS_CLK_OUT1 Min. output delay Tsysdm 
DATA_BUS_REQ, input setup Tdsu 

pDaTa<127:0>, aDDR<39:4> 

DATA_BUS_REQ, input hold Tdh 

DATA<127:0>, ADDR<39:4> 

ADDR<39:4> output delay Taod 1 
ADDR<39:4> output hold time Taoh 

pata<127:0> output delay + Teycle + 0.4nS Tdod 1,2 
para<127:0> j d + Tcycle Tdoh 1,2 
Non-Turbo Mode 

ADDR_BUS_REQ 3.8nS Tabrsu 
ADDR_BUS_REQ -1.0nS Tabrh 

CACK, DACK 3.4nS Tntacksu 

CACK, DACK -1.0nS Tntackh 

Turbo Mode 

ADDR_BUS_ 1.1nS Ttacksu 3 


CACK, DACK: 
Ttackh 


ccounts for on chip driver mismatch and clock skew. 
ite transactions initiated by DECchip 21164-AA, data is driven 1 CPU cycle later. 
ade, control pins are piped on-chip for one SYS_CLOCK_OUT1 before usage. 
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Figure 8-3: SYS _CLK System Timing 





Relationship of CPU CLK and SYS_CLK_OU 





CPU CLK 








SYS_CLK_OUT1 





SYS_CLK_OUT1 














CPU CLK 








ADDRESS/CMD Out 








CACK/DACK 








DATA In 





SY¥S_CLK_OUT1 








CPU CLK 


i=. 
De 





ADDRESS/CMD Out y 











CACK/DACK 



























para TIANA ADNAN XXX AAA XR 
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8.4.2 Reference Clocks 


applied to the REF_CLK_IN pin. Phase locking is accomplished by* 
stall for one phase whenever the rising edge of REF_CLK_IN is detected iG: 
the rising edge of the internal CPU CLK that causes a rising eof SYS. 
all timing is specified with respect to the rising edge of RE 


occured just before 
_OUT1_H. Note that 


Table 8-5: Systems Using REF_CLK 

























DATA_BUS_REQ, input setup 1.1nS Tdsu 
DATA<127:0>, aDDR<39:4> 
DATA_BUS_REQ, input hold 0.5 x Teycle | Tsdadh 
paTa<127:0>, appR<39:4> 
ADDR<39:4> output delay Traod 1 
ADDR<39:4> output hold time Traoh 
paTa<127:0> output delay er 1.5 xifvycle + 0.9nS Trdod 1,2 
pata<127:0> output hold tim Trdoh 
Non-Turbo Mode 
ADDR_BUS_REQ Tntrabrsu 
ADDR_BUS_REQ Tntrabrh 
CACK, DACK Tntracksu 
CACK, DACK Tntrackh 
Turbo Mode 
ADDR_BUS_REQ:4 ; Ttracksu 3 
CACK, DACK & : 
0.5 x Teycle Ttrackh 3 


ccounts for on chip skews which include - 0.4nS for driver mismatch and clock 
‘phase detector skews due to circuit mismatch (0.2nS) and delay in REF_CLK_IN 
package (0.3nS). 


2. For alte ransactions initiated by DECchip 21164-AA, data is driven 1 CPU cycle later. 
3. In Turbo Méde, control pins are piped on-chip for one SYS_CLOCK_OUT1 before usage. 
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Ae internal CPU 
y than the external 
ally the gain will 


4905T1, a relationship be- 
LK_OUT1 internally to 





CPU CLK 











REF_CLK_IN 


CPU CLK 











REF_CLK_IN 








SYS_CLK_OUT1 
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8.4.3. Timing - Other Pins 


Table 8-6: Asynchronous Input Pins 
REF_CLK_IN_H SYS_RESET_L 
CLK_MODE_H DC_OK_H 
SYS_MCH_CHK_IRQ.H PWR_FAIL_IRQH 


















Spec 


IRQ_H hold time 
from deassertion of SYS_RESET_L 


Table 8-7: Timing for SYS_CLK Ratio Programming 












Table 8-8: Other Pin List 


Input Only 
CFAIL_H 
FILL_ERROR_H 
IDLE_BC_H 

TMS_H 
SROM_PRESENT_L 


Output Only 
VICTIM_PENDING_H 
SCACHE_SET_H 
DATA_RAM_OE_H 
SYS_CLK_OUT2 
SROM_OE_L 


Bi-directional 
kDR_CMD_PAR_H 
TAG DATA PAR H 

TAG _DIRTY_H 


DIGITAL CONFIDENTIAL 




















SYSTEM_LOCK_FLAG_H 


TDI_H 
TRST_L 


INT4_VALID_H 
TAG_RAM_WE_H 
TDO_H 
CPU_CLK_OUT_H 


DATA_CHECK_H 
TAG_VALID_H 
TAG_CTL_PAR_H 
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Table 8-9: Other Pin Input Timing (SYS_CLK or REF_CLK based 


input setup 
input hold 


Table 8-10: 
Group A 





Other Pin List - Output Pin Groupings 


















CPU_CLK_OUT_H SYS_CLK_OUT2 
SROM_OE_L TDO_H 

Group B 

ADDR_CMD_PAR_H ADDR_RES_H 
SCACHE_SET H VICTIM_PENDING_E: 
Group C 

DATA_CHECK_H INT4_VALID_H 
Group D 

DATA_RAM_OE_H 

TAG_DATA_H 

TAG RAM OE H 


TAG_VALID_H 





Table 8-11: EK based systems 


Group A 


output delay Tdd 
output hold time Tmdd 
Group B 

output delay Taod 
output hold tim Taoh 
Group C 

output dela Tdod 
output hol Tdoh 


‘tvansactions) 
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Table 8-12: Group Output Timing - REF_CLK based systems 


Group A 


output delay Tdd 
output hold time Tmdd 
Group B 

output delay Traod 
output hold time Traoh 
Group C 


output delay 
output hold time 

Group D (non-BCache transaction) 
output delay 

output hold time 

Group D (during BCache transaction) 
output delay 
output hold time 
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8.5 Clock Test Modes 


8.5.1 Normal Mode 


When the CLK_MODE_H<1:0> pins are not asserted the OSC_ 
This is the normal operational mode of the clock circuitry. 






is divided by 2. 


8.5.2 Chip Test Mode 


In order to lower the maximum frequency required to be sii 
mode has been designed into the clock generator gee cuitry. Wik 
asserted and CLK_MODE_H<I1> is not asserte 
pins OSC_CLK_IN_H,_L bypasses the clock di 
allows the chip internal circuitry to be tested 4 
CLK_IN. 


ied by the tester, a divide by 1 
ithe CLK_MODE_H<0> pin is 
















eficy applied to the input clock 
.to the Chip Clock driver. This 
1/2 frequency (294 MHz) OSC_ 


8.5.3 Module Test Mode 


CLK_MODE_H<1> is asserted, the clock 
=IN_H,_L is divided by 4 and is sent to the 
ip (DPLL) continues to keep the on chip SYS_ 
rmal limits if a REF_CLK_IN signal is applied 
_IN). 


Chip Clock driver. The Digital 
CLK_OUT1 locked to REF_CLK: 


8.5.4 Clock Test Rese 


When both the CLK_] 
CLK generator circus 


and the CLK_MODE_H<1> pin are asserted, The SYS_ 
t to a known state. This allows the tester to synchronize 


{ MODE_H<0> CLK_MODE_H<I> 
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8.6 Test Configuration 


Figure 8-5: Test Configuration 


Package Pin 
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8.7 Revision History 


Table 8-14: Revision History 





Andy Olesin Oct. 18, 1993 
Andy Olesin July 27,1993 changed test res 
Andy Olesin May 27,1993 Refined the AC’ 

Anil Jain April 05, 1993 
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Chapter 9 


Pinout 










9.1 DECchip 21164-AA Pinout Overview ; 


The DECchip 21164-AA chip is contained in the | 
signals, the remaining pins are used for power and gré 
the crude pinout of DECchip 21164-AA will look like this: 


Age. 289 of these pins are used for 
ooking down at the top of the chip 


Figure 9-1: DECchip 21164-AA Pinout 





ADDR<39:4> 
CMD<3:0> 
Clocks 
TRO<5:0> 

SROM Interface 


DATA<63:0> | DECchip 21164- 
CHECK<7:0> | 


| 
| 
| 
| 
+ 


are'tutput only, those listed as "I" are input only, and those listed as "B" 


are bidirecti tri-statable. 
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Table 9-1: Clock Pins 
Signal Name 








Function 















OSC_CLK_IN_H,L I CPU clock input 2 
CPU_CLK_OUT_H O CPU clock output 1 
SYS_CLK_OUT1_H,L Oo System clock output 2 
SYS_CLK_OUT2_H,L O System clock output 2 
REF_CLK_IN_H I System clock input 1 
CLK_MODE_H<1:0> I Clock logic mode select 2 
SYS_RESET_L fi Reset 1 










Section Total 


1This input may be driven asynchronously; an internal synchro 
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Table 9-2: System Interface Pins 




























ADDR_H<39:4> B Address bus 
CMD_H<3:0> B Command bus 4 
ADDR_CMD_PAR_H B Odd parity for address and CMD 1 
VICTIM_PENDING_H O This miss produced a victim 1 
ADDR_BUS_REQ H if System wants to use tk 1 
dress and command bu 
CACK_H I DECchip 21164-AA commai faust be deasserted 1 
taken 
CFAIL_H sf last be deasserted 1 
FILL_ERROR_H I must be deasserted 1 
ADDR_RES_H<1:0> O NOP 2 
INT4_VALID_H<3:0> oO unspecified 4 
SCACHE_SET_H<1:0> 0) unspecified 2 
FILL_H I must be deasserted 1 
FILL_ID_H I should be deasserted 1 
DACK_H I must be deasserted 1 
FILL_NOCHECK_H I should be deasserted 1 
SYSTEM_LOCK_FLAG_H ate of the lock flag should be deasserted 1 
IDLE_BC_H .CPU accesses to the must be deasserted 1 


DATA_BUS_REQ H System wants to use the data 7 1 


Section Total 
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Table 9-3: Bcache Pins 





Function 




















INDEX_H<25:4> 
DATA_H<127:0> B Data Bus 


Bcache index 


128 


DATA_CHECK_H<15:0> B INTS8 ECC check bits or byte 16 
parity 
TAG_DATA_H<38:20> B B-cache tag (1MB min) 19 
TAG_DATA_PAR_H B Tag parity 1 
TAG_VALID_H B Tag valid 1 
TAG _SHARED_H B Tag shared 1 
TAG_DIRTY_H B Tag dirty 1 
TAG_CTL_PAR H B 1 
TAG_RAM_OE_H O asserted 1 
TAG RAM_WE_H oO deasserted 1 
DATA_RAM_OE_H O asserted 1 
DATA_RAM_WE_H O deasserted 1 





Section Total 
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Table 9-4: Interrupt and Misc. Pins 


Function 





TRQ_H<3:0> Interrupt requests 


SYS_MCH_CHK_IRQ H i System machine check inter- 
rupt 


PWR_FAIL_IRQ H pr Power failure interrupt 
MCH_HLT_IRQ_H p Halt request 
PORT_MODE_H<1:0> r Test port mode 
TDLH I IEEE 1149.1 Serial Data Input: 
TDO_H 0 
TMS _H I 
I 
I 
0 


delay input? 


TCK_H 

TRST_L 
TEST_STATUS_H<1:0> 
SROM_PRESENT_L 
SROM_OE_L 
SROM_CLK_H 
SROM_DAT_H 
DC_OK_H 
PERF_MON_H 
TEMP_SENSE 


should be asserted? 
deasserted 
deasserted* 


deasserted4 


a 


Section Total 








1This(These) input(s) m: 
2Input for SYS_CLK_O 


3TRST_L can be asser 
functions in test modes. * 


4If PORT_MODE 


ously; an internal synchronizer is implemented. 
ive to SYS_CLK_OUT1_H,L. 
ration to ensure the IEEE 1149.1 port remains inactive. The pin has special 





utput is unspecified. 
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9.3 Revision History 


Table 9-5: Revision History 






Who When Description of chan: 


3/22/92 New pinout 
Pete Bannon 4/22/92 Change asserti 
Pete Bannon 10/22/92 Update test pins: 
JHE 4-DEC-1992 Add reset informati 
JHE 25-OCT-1993 _ edits: clk Mibde, ref _cl 












Pete Bannon 


vref_h,ecl_out_h,srom_present,por 
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Chapter 10 


The Package 


TBS 
This chapter is To Be Supplied. 
10.1 Revision History 
Table 10-1: Revision History 
Who When Description of change 
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Chapter 11 


Test Interface and Testability Features 


11.1 Introduction 


The DECchip 21164-AA CPU chip’s testability features address broad issues of providing cost- 

effective and thorough testing of DECchip 21164-AA through its life cycle. Some specific goals 

supported by DECchip 21164-AA testability features include: 

_ © Chip debug. 

e Efficient and thorough testing of embedded RAM arrays. 

¢ Built-in Self Repair (BiSr) of instruction cache (ICache) and support for reduced probe test 
for efficient and low cost wafer probe testing. 

¢ High fault coverage chip manufacturing test. 

¢ Effective burn-in test. 

¢ Module assembly verification test via IEEE 1149.1 architecture. 

¢ Automatic power-on Built-in Self-test (BiSt) of the ICache. 


¢ Limited support for concurrent fault detection in fault tolerant system that employ duplicate 
DECchip 21164-AAs. 


The testability features included on DECchip 21164-AA include [Cache self-test and self-repair, 
internal Linear Feedback Shift Registers (LFSRs) and scan observability registers, support for 
reduced probe count wafer probe test, IEEE 1149.1 test access port. and boundary scan register, 
and several other test features. DECchip 21164-AA also includes a comprehensive test interface 
port that permits efficient access to the chip’s testability and diagnosability features during debug 
and manufacturing testing phases. 


11.2 Test Port 


Test Interface Port on DECchip 21164-AA consists of 13 dedicated pins that support three port in- 
terface modes: 1) Normal mode, 2) Manufacturing test mode, and 3) Debug test mode. Table 11-1 
summarizes the test port pins and their functions in the three modes. 
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Table 11-1: DECchip 21164-AA Test Port Pins and Port Modes 
Normal Function Manufacturing Debug 

Pin Name Typ Signal Typ Signal Typ Signal Typ 
PORT.MODEH<I> I LOW I Low I HIGH I 
PORT_MODE_H<0> B LOW I HIGH I dbg_data_h< 8> 0 
SROM_PRESENT_L B srom_present_| I test control I dbg_data_h< 7> O 
SROM_DATA_H I srom_data_h/Rx I srom_data_h I srom_data_h/Rx I 
SROM_CLK_H oO srom_clk_h/Tx oO obs_data_h< 8> ré) srom_clk_h/Tx Oo 
SROM_OE_L 10) srom_oe_l oO obs_data_h< 7> oO srom_oe_] 0 
TDLH B tdi_h I obs_data_h< 6> oO dbg_data_h< 6> Oo 
TDO_H QO _ tdo_h Oo obs_data_h< 5> 0 dbg_data_h< 5> 0 
TMS_H B tms_h I obs_data_h< 4> oO dbg. data_h< 4> Oo 
TCK_H B tck_h I obs_data_h< 3> oO dbg_data_h< 3> oO 
TRST_L B trat_l I obs_data_h< 2> re) dbg_ data h< 2> oO 
TEST_STATUS_H<0> OO _ test status Oo test status / obs_ Oo dbg_data_h< 1> 10) 

data_h< 1> 
TEST_STATUS_H<1> oO test status Oo test status / obs_ @] dbg_data_h< 0> Oo 

data_h< 0> 


11.2.1. Normal Test Interface Mode 


The test port is in normal test interface mode when the PORT_MODE_H<1:0> are tied to 00. 
This is the default mode. In this mode the test port supports a serial ROM interface, a serial 
diagnostic terminal interface, and an IEEE 1149.1 test access port. 


11.2.1.1. SROM Port 


SROM_PRESENT_L, SROM_DATA_H, SROM_OE_L, SROM_CLK_H constitute the SROM in- 
terface. 


If serial ROMs (such as an AMD Am17386) are present in the system, the pin SROM_PRESENT_ 
L may be pulled down on the board. DECchip 21164-AA samples this pin during the system 
reset. If the pin is pulled down during the system reset, then the DECchip 21164-AA’s reset 
sequence automatically loads its ICache from serial ROMs before executing its first instruction. 
If SROM_PRESENT_L is pulled-up during system reset, the SROM load is disabled. In this case 
the ICache valid bits are cleared by the reset sequence, causing the first instruction fetch to miss 
the ICache and seek the instructions from the off chip memory. 


During SROM load: 
¢ SROM_OE_L signal supplies the output enable to the serial ROM, serving both as an output 
enable and as a reset (refer to the serial ROM specifications for details). 


DECchip 21164-AA asserts this signal low for the duration of ICache load from serial ROM. 
Once the load is complete, the signal is deasserted. 
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¢ SROM_CLEK_H output signal supplies the clock to the ROM that causes it to advance to the 
next bit. The cycle time of this clock is 128 times the cpu clock rate. 


¢ SROM_DATA_H pin reads the serial ROM data. 


The serial ROMs can contain enough ALPHA code to complete the configuration of the external 
interface (e.g. setting the timing on the external cache RAMs, and diagnose the path between 
the CPU chip and the real ROM). 


DECchip 21164-AA is in PALmode following the deassertion of system reset and the conclusion 
of ICache self-test - this gives the code loaded into the [Cache access to all of the visible state 
within the chip. 


See Section 11.4 for the details of the ICache fill operation from SROMs. 


11.2.1.2 Serial Terminal Port 


Once the data in the serial ROM has been loaded into the ICache, the three SROM Port pins 
turn into a simple parallel I/O pins that can be used to drive a diagnostic terminal such a RS422. 


When the serial ROM is not being read, the SROM_OE_L output signal is false. The serial 
diagnostic terminal port is enabled if this pin is wired to the active high enable of an RS422 
(or 26LS32) receiver driving onto SROM_DATA_H and to the active high enable of an RS422 (or 
26LS31) driver driven from srom_clk_h pin. The CPU allows SROM_DATA_H to be read and 
SROM_CLK_H to be written by PALcode. This supports a bit-banged serial interface. 


IPRs associated with this interface are described in the chapter on PAL Code and IPRs. 


11.2.1.3 IEEE 1149.1 Test Access Port 


TDI_H, TDO_H, TCK_H, TMS_H and TRST_L make up the IEEE 1149.1 test access port. This’ 
port accesses DECchip 21164-AA chip’s boundary scan register and chip tri-state functions for 
board level manufacturing test. The port also allows access to the die identification code. The 
port is compliant with all requirements of IEEE 1149.1 test access port. See IEEE Std. 1149.1 
"A Test Access Port and Boundary Scan Architecture" for the full description of the specification. 


Figure 11—1 shows the user-visible features from this port. 
TAP Controller 


The TAP Controller contains a state machine. It interprets IEEE 1149.1 protocols received on 
TMS_H signal and generates appropriate clocks and control signals for the testability features 
under its jurisdiction. 

Bypass Register 


The Bypass Register is a 1-bit shift register. It provides a short single-bit scan path through the 
port (chip). 
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Figure 11-1:. IEEE 1149.1 Test Access Port 
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Instruction Register 


The Instruction Register (IR) is 3-bits wide. It supports EXTEST, SAMPLE, BYPASS, HIGHZ and DIE_ID 
instructions. Table 11-2 summarizes the instructions and their functions. 


During the capture operation, the shift register stage of IR is loaded with ’001’. This automatic 
load feature is useful for testing the integrity of the IEEE 1149.1 scan chain on module. 
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Table 11-2: Instruction Register 


IR< 2:0> Name Scan Register Selected Remarks 

111 BYPASS Bypass Register Default. 

110 HIGHZ Bypass Register Tristates all I/O and output pins 
101 BYPASS Bypass Register Duplicate BYPASS 

100 HIGHZ Bypass Register Duplicate HIGHZ 

O11 DIE_ID Die ID Register 

010 SAMPLE Boundary Scan 

001 DIE_ID Die ID Register Duplicate DIE_ID 

000 EXTEST Boundary Scan Register BSR drives chip /O and output pins 


Note that the SAMPLE, BYPASS and DIE_ID instructions are non-intrusive. That is, they could 
be operated while chip is doing its normal functions. EXTEST and HIGHZ instructions force 
chip’s internal logic to a reset state. 


Die-ID Register 


Die-ID Register is 32-bit scan register. It shifts out fuse-programmed die information. The format 
and content of the information to be programmed will be determined by the manufacturing. 


Boundary Scan Register 


Boundary Scan Register on DECchip 21164-AA is approx. 286 TBD bits long. It supports 
SAMPLE and EXTEST instructions. See Section 11.9 for the organization of this register. 


Effects of EXTEST and HIGHZ instruction 
The effect of EXTEST or HIGHZ instruction on DECchip 21164-AA chip is as follows 


¢ EXTEST instruction allows the boundary scan register to have complete control over the 
output and bidirectional pins. HIGHZ instruction forces all output and bidirection pins to a 
high impedance state. 


¢ The effect on clock input and output pins is TBD. 


¢ The internal chip logic is forced to a reset state. This prevent the cpu from reacting to 
irrelevant test data that may appear at the chip’s inputs. 


11.2.1.4 Test Status Pins 


Two test status pins TEST_STATUS_H<1:0> pins are used for extracting of test status information 
from the chip. System reset drives both test status pins low. 


¢ During [Cache BiSt 


TEST_STATUS_H<<0> is asserted high to indicates that the ICache BiSt has failed. TEST_ 
STATUS_H<1> is asserted high to indicate presence of more than two failing rows. 
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The start of ICache BiSt forces TEST_STATUS_H<0> pin to go high. If the ICache BiSt 
passes, TEST_STATUS_H<0> is deasserted, otherwise it remains asserted. TEST_STATUS_ 
H<1> is asserted as high as soon as third bad ICache row is detected. This nay be used to 
detect unrepairable ICache early, thus reducing average test time. System users may ignore 
this pin. 

¢ During On-Line LFSR mode 


When the internal LFSRs are turned on in on-line mode (ON_OBL_1 command described 
later), the TEST_STATUS_H<0> outputs the quotient generated by the observability LFSRs. 
A new quotient bit is observed with every system clock rising edge. This feature is useful to 
people implementing fault tolerant systems. Also, the feature can be exploited for the burn-in 
and life test for monitoring failures. See Section 11.6.2 for more details. 

e¢ IPR Read/writes to Test Status Pins 


PALcode can write to TEST_STATUS_H<I> and can read the TEST_STATUS_H<0> via hard- 
ware IPR access. See Chapter 3. 


The default operation for TEST_STATUS_H<0> pin is to output the BiSt result. The default 
operation for TEST_STATUS_H<1> pin is to output the IPR written value. 


11.2.2 Manufacturing Test Interface Mode 


The DECchip 21164-AA test port is in Manufacturing Test Interface Mode when PORT_MODE_ 
H<1:0> are tied to 01 (binary). This mode allows control of ICache test features, internal LFSR 
and Scan Observability Registers, and efficient byte-serial read-out of observability features, 
including ICache bit map. Figure 11-2 shows the user-visible features during manufacturing 
test interface mode. 


The SROM_PRESENT_L pin is used for test control. Asserting a high on this pin initiates a 
test operation state. In this state, DECchip 21164-AA chip automatically loads the 8-bit Test 
Command Register and executes all required test actions, including any additional shift opera- 
tions. Input test data is serially fed at the SROM_DATA_H pin. Test results from chip are shifted 
out byte-serially (9 bits at a time) on the test pins. 


The SROM_PRESENT_L pin may be returned to low once test shift operation has been initiated. 
A new test command may be loaded by once again asserting a high on SROM_PRESENT_L pin 
after all actions of the previous command have been completed. 


When the manufacturing test interface mode is activated, all inputs to the IEEE 1149.1 port are 
driven with their default values. 


Test Command Register (TCR) 


Test Command. Register is 8 bits wide. Table 11-3 summarizes the test commands and their 
actions. 
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Figure 11-2: Manufacturing Test Interface Mode 
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Table 11-3: Test Command Register 


TCR< 7:0>__ 
00 000 XXX 


Command 
Mnemonic 


RD_ICache 


Action 


Reads out ICache contents on test port. Useful for debug/bit mapping 


00 001 00X 


00 001 01X 


00 001 10X 


00 001 11X 


_ 00 100 XXX 
00 101 XXX 


00 110 XXX 
00 010 XXX 


00 011 XXX 
00 111 XXX 


01 ss dddd 


10 XO OXXX 
10 XO 10XX 
10 XO 11XX 
10 X1 OXXX 


10 X1 1XXX 


11 XX XXXX 
11 :11:1111 


WR_IC_FO 
WR_IC_F1 


WR_IC_SO 


WR_IC_S1 


LD_BKG 
SC_FRCAM 


SC_BIST 
RU_BIST 


RU_RETENT 
IC_NOP 


SC_src_delay 


OFF_OBL 
ON_OBL_0O 
ON_OBL_1 
OFF_CBL 


ON_CBL 


PRT_NOP 
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etc. 


Writes ICache serially. Data shifts at system clock rate. Internal 
chip reset extended. Used for subsequent read out for [Cache test 
purposes. 


Writes ICache serially. Data shifts at system clock rate. Internal 
chip reset NOT extended. Used for speedier ICache fill during man- 
ufacturing. 


Writes [Cache serially. Data shifts at slow rate. (cycle time = cpu 
clock cycle * 128) rate. Internal chip reset extended. 


Writes ICache serially. Data shifts at slow rate. (cycle time = cpu 
clock cycle * 128) rate. Internal chip reset not extended. This in- 
struction is forced by CPU during power-on/reset sequence to auto- 
matically load from SROM. 


Loads ICache fill Sean path with background pattern. This instruc- 
tion is forced by the BiSt logic. 


Scans out Failing Row CAM on test port. This instruction is forced 
by the BiSt logic. 


Scans out portions of BiSt logic for testing the BiSt logic. 


Runs [Cache BiSt. This instruction is forced by the power-on/reset 
sequence. 


Runs [Cache Retention BiSt. 
No [Cache action. However, forces internal chip reset. 


Scans out selected register. src = OX selects LFSR scan path. sre = 
1X selects internal scan register. delay selects cycle (0 to 15) to be 
captured for observation. The command performs the capture-scan 
out sequence continuously, until another test command is loaded. 


Turns off Observability LFSR data compression mode. 
Turns On the Observability LFSR data compression in off-line mode. 
Turns On the Observability LFSR data compression in On-line mode. 


Turns Off (if previously turned on) the intrusive controllability fea- 
tures, LFSR pattern generators, etc on the chip. 


Turns On the intrusive controllability features (such as LFSR pattern 
generators) on the chip. 


Reserved. 
No test port actions. 
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Notes: 


1. The internal chip logic is forced to reset during all ICache test commands. 


2. The cycle time for shifting data during SROM load is 128 * cpu clock cycle. Assuming the 
fastest DECchip 21164-AA cycle time of 3ns, this translates to the fastest shift rate of 384ns.. 


8. The scan and LFSR observability registers can be operated and read out without interfering 
with normal system operation. 


iCache Fill Scan Register 


This is a 200-bit long scan register used for filling the ICache serially from SROMS or tester. See 
section Section 11.4 for the details of the serial fill operation. 


iCache Read Scan Register 


This is 100-bit long read scan register path used for dumping the [Cache contents. See section 
Section 11.4 for the details of the serial read operation. 


Observability LFSRs 


This is TBD-bit register used for enhancing fault coverage of manufacturing test. See section 
Section 11.6 for details. 


Observability Scan Registers 
This is TBD-bit register used primarily for chip debug. See section Section 11.7 for details. 
Controllability Features 


Test Port also has the provision for supporting internal controllability features. If these features 
are provided, they are turned on and off via the ON_CBL and OFF_CBL test commands. 


FRCAM Scan Register 


This scan register is 13-bit long. It consists of 12 bits of failing row CAM and unrepairable_ic 
flag. See Section 11.3 for more details. 


Port Observability Register 


This is 9-bit serial-in Parallel-out observability register. The parallel outputs of the register 
update the corresponding test port pins every system clock cycle. This allows tester to observe 
9-bits of scan data simultaneously. This reduces the vector depth requirement on chip tester’s 
failure capture memory (DFM) by a factor of eight. 


The internal observability LFSRs and the Internal Scan Registers shift at the chip’s internal 
clock rate. The scan paths in ICache test logic shift at the system clock rate. 


11.2.3 Debug Test Interface Mode 


DECchip 21164-AA test port is in Debug Test Interface Mode when PORT_MODE_H<1> is tied 
to 1. Debug Test. Interface Mode allows the critical chip nodes to be monitored in parallel. 


Signals to be observed on parallel port are selected by TBD IPR bits. (See chapter on PALCode 
and IPRs for the details.) 
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Restrictions of parallel debug test port 


1. When parallel debug port is activated, all inputs corresponding to the normal test input pins 
are fed with their default values. 


2. The PORT_MODE_H<1> pin allows to switch back and forth between the normal test port 
and the parallel debug port. 


3. Parallel debug port is designed to support chip/system debugging in prototype sys- 
tem environments only. Some small logic may be required to ensure that there is no 
interference with other chips connected to the test port. 


11.2.4 Activating Debug/Manufacturing Port Modes in a System 


Both Debug and Manufacturing port modes can be activated in a system by incorporating a few 
jumpers, and if necessary, some support logic. Jumpers are required as some of the test pins 
are shared for outputing the debug/observability information from the chip. Jumpers prevent 
observability data from interfering with the operation of the other chips connected to the shared 
test pins. Support logic is required only if system wants to load test commands automatically 
through the manufacturing test port mode, for example, to turn on/off the observability LFSRs 
in on-line mode. 


* Figure 11-3 shows a typical module and the places where jumpers may be necessary to activate 
the debug and manufacturing test port modes. 


11.3 iCache BiSt 


The DECchip 21164-AA ICache is tested by Built-in Self-test that implements a full march algo- 
rithm. The self test logic covers all three (Data, BHT, and TAG) ICache arrays. 


ICache BiSt is invoked automatically upon deassertion of system reset if the BiSt is not bypassed. 
BiSt is bypassed if the PORT_MODE_H<1> pin is asserted high during system reset. 


BiSt Bypass feature allows ICache BiSt as well as the Built-in Self Repair to be bypassed during 
debug and in between pattern runs on testers, if so desired. 


BiSt runs for TBD cycles. 


The Go/NoGo result of BiSt is made available on TEST_STATUS_H<0> pin. TEST_STATUS_ 
H<0> is forced low by the system reset and high with the start of the BiSt. If at the end of 
BiSt, any of the ICache rows are bad, the pin remains asserted high, otherwise it is deasserted. 
Software can read this status through an IPR. If [Cache fails in more than two rows, TEST_ 
STATUS_H<1> is asserted high. This pin is cleared by the the system reset. 
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Figure 11-3: Tes Port Connections on Module 
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Built-in Self-Repair 


When the BiSt is invoked on wafers that have not gone through the fuse repair process, the 
ICache BiSt sequence automatically performs the following steps. 


Perform the BiSt. Store up to two the failing row addresses in the failing row CAMs. 
¢ Self repair the ICache data array. 
¢ Repeat BiSt. 


¢ Dump the content of the failing row CAMs on the test port. 


The repair information shifted out consists of the following bits. 
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Table 11-4: FRCAM Scan Register Organization 


Field Name Extent 

unrepairable_ic_flag 0 High = unrepairable cache 
CAM_0 Valid Flag 1 ’ High = 1st repair address valid 
CAM_0 Reg 2:6 ist repairable row-pair address 
CAM_1 Valid Flag 7 High = 2nd repair address valid 
CAM_1 Reg 8:12 2nd repairable row pair address 
Notes: 


¢ The automatic BiSt and BiSR run identically under the normal and the manufacturing test 
interface modes. 


¢ Built-in self-repair feature is available only prior to laser repair process. BiSt logic uses a 
fuse programmed internal signal to determine whether the BiSR is required. 


11.4 ICache Serial Write and Read Operations 


Serial Write Operation 


The ICache can be written serially from the SROM or for testing purposes from the SROM 
port pins. On DECchip 21164-AA, all ICache bits, including each block’s tag, ASN, ASM, valid 
and branch history bits can be loaded serially. Once the serial load has been invoked (either 
automatically by the chip reset sequence, or via the IC_WR_xx command from the manufacturing 
test port), the entire cache is loaded automatically from the lowest to the highest addresses. 


The serial bits are received in a 200-bit long Fill Scan Path from which they are written in parallel 
into the ICache address. The Fill scan path is organized as shown in Figure 11-4. The farthest 
bit (tag< 42>) is shifted in first and the nearest bit ( BHT< 7>) is shifted in last. Note that the 
data and predecode bits in the data array are interleaved. 


The automatic serial fill invoked by the chip reset sequence occurs at the slower SROM clock rate 
(period = cpu clock rate * 128). The serial invoked by IC_WR_xx can occur at the SROM rate or 
at the system clock rate. In either case, the ICache fill operation is automatic. 


Serial Read Operation 


All three ICache arrays can be read out serially for testing purposes. Manufacturing port’s IC_RD 
command initiates the serial read operation. Arrays are dumped from the lowest to the highest 
address. The data is first received into a READ Scan Path (RSP), from which it is serially shifted 
to the test port’s Port Observability Register at the system clock rate. The data can be read out 
at the test port pins 9-bits at a time. 
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Figure 11-4: SROM Fill Scan Path Bit Order 





SROM_DATA_H serial input -> 

BHT Array 7-> 6 -> ...-> 0 -> 

Data : 127 -> 95 => 126 -> 94 -> ... -> 96 -> 64 -> 
Predecodes 19 -> 14 -> #18 -> #13 -> ... -> 15 => 9 => 
Data parity b -> 

Predeocde parity b -> 

Predecodes 9 -> 4 -> 8 -> 3.-> ... - 5 -> 0 -> 
Data 63 -> 31 -> 62 -> 30 -> ... -> 32 -> 0 -> 
Tag Parity b -> 

Tag Valids 0 -> 1 -> 

TAG ASM b -> 

TAG ASN 0 -> 1 -> ... -> 7 -> 

TAGs 13 -> 14 -> ... -> 42 


b = Single bit signal 


Figure 11-5: Read Scan Path Bit Order 


Serial out serial out <- 
BHT array leader dmy <- err <- rfl <- rf0 <- 
BHT Bits 7 <- 6 <- 1... <- 0 <- 
Data array leader dmy <- err <- rfl <- rf0 <- 
Data Bits d37 <- 36 <- ... <- 0 <- 
Tag array leader dmy <- err <- rfl <- rf0 <- 
Tag Parity b <~ 
Tag Valids 0 <- 1 <- 
TAG ASM b <- 
TAG ASN 0 <- lL <~ 1... <- 7 <= 
TAGs 13 <- 14 <- ... <= 42 

b Single bit signal 


dmy = Dummy bit. Makes RSP for the array even bit length 
err = Error bit. Useful for BiSt logic testability 
rfl,rfO = Used by BiSt logic to store reference patterns 


The RSP is 100-bits long and consists of three segments: 12-bit BHT segment, 42-bit Data array 


segment, and 46-bit Tag array segment. Besides the bits that capture data from [Cache array, 
each segment has 4 extra bits used by the BiSt logic. 


The 150 bits of data from the data array are read into the 38 bits of Read Scan Path via a 
multiplexer which selects one of the four physically adjacent data bits. The entire array is read 
by making four passes through the ICache addresses. (Note that this causes the BHT and tag 
arrays to be read four times!) This necessitates that the data dumped by the serial ICache read 
operation must be carefully reconstructed before interpreting them. 


The organization of the bits in the read scan path is shown in Figure 11-5. 


11.5 SCache/DCache Test Features 


See PALCode and IPR chapter and the cache section in chapter on DECchip 21164-AA Microarchitecture. 
Also, see Section 11.7 for description of SCache scan chain. 
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SCache Test and Repair Algorithm 
TBD. 


11.6 Observability LFSRs (OBLs) 


11.6.1 Organization 


DECchip 21164-AA implements several Observability LFSRs (OBLs) to enhance the fault cover- 
age. The OBLs are turned and off on by the ON_OBL_x and OF F_OBL test commands described 
in Section 11.2.2. LFSRs also operate as ordinary scan registers. They are read out by the SC_ 
src_delay test command. 


All LFSRs in DECchip 21164-AA are accessed from a single scan chain. Figure 11—6 summarizes 


the LFSR organization. The details of the signals captured and the LFSR design (feedback taps) 
are given in Table 11-5. . 


Figure 11-6: LFSR Chain Organization 


ee nl eee Ve een ne 


| 

28 bits tbd bits tbd bits | 

| 

| 

i, mci a : 7 acme canimieace a: ‘ laa | 

Serial<-~| IBOX OBLs |<----{ EBox OBLs |<---| FBox OBLs |<-’ 
Out 5 | ee oe a ’ Sin SS Se ca ai dss 

thd bits thd bits tbd bits 


Table 11-5: Observability LFSR Organization 

LFSR Name: Backup Cache Index Pins 

Size: 28 bits 

Feedback polynomial: 2200000001(Octal, Taps bits 28 and 25) 
Access Chain Number: .... 


Bit # Signal name Remarks 

28 feedback 

27:26 p%ev5_sc_set_h< 01:1> unprobed outputs 
25 feedback 

24:3 p%be_index_h< 4:25> Unprobed outputs 
2:1 p%ev5_adr_res_h< 1:0< Unproebed outputs 
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Table 11-5 (Cont.): Observability LFSR Organization 


0 available 
LFSR Name: 
Size: 


Feedback polynomial: .... 
Access Chain Number: .... 
Bit # Signal name Remarks 


0 Tod TBD 


(As the design work progresses, more details of LFSR operation will be added here.) 


11.6.2 On-line LFSR Operation 


DECchip 21164-AA supports an on-line testing mode via its observability LFSRs. The quotient 
bit generated by the observability LFSR in IBOX is brought out to the TEST_STATUS_H<0> 
pin when the LFSRs are turned on in an on-line mode (ON_OBL_1 test command). Monitoring 
and comparing this pin with the expected serial stream can provide an indication of DECchip 
21164-AA health on the fly. 


This feature can be exploited by the fault tolerant systems that employ multiple redundant. 
DECchip 21164-AAs. They can compare the TEST_STATUS_H<0> on two or more DECchip 
21164-AAs performing identical tasks. The same principle can be extended in other test applica- 
tions such as burn-in test for monitoring failures. ’ 


During the on-line test mode, a new quotient bit is observed with every system clock rising edge. 
Since the observability LFSRs work at the CPU clock rate, not every quotient bit is observed 
on TEST_STATUS_H<0>. This is generally acceptable since typically an error in an input to an 
LFSR produces a multitude of erroneous quotient bits. 


Note that the LFSRs must be turned on only after DECchip 21164-AA initialization has been 
completed. 


11.7 Observability Scan Registers (OBSs) 


Internal Sean Registers offer observability of debug-critical signals. They are accessed from the 
test port under the manufacturing test interface mode as described in Section 11.2.2. The capture 
action of internal scan register occurs TBD cpu cycles after the Test Command Register is loaded 
with the appropriate SCAN command. Table 11-6 gives organization of the DECchip 21164-AA’s 
Observability Scan Registers. 
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Table 11-6: Observability Scan Register Organization 


Scan Chain Name: SCache 


Scan Chain for Part 1 of SCache: 
Bit # Signal name 


Remarks 


LW Parity for Data<31:0> 
Data<0:31> 

LW write enable for Data<31:0> 
Address driven to SCache 
SET_HIT signal, set 2 
SET_HIT signal, set 1 
SET_HIT signal, set 0 

LW write enable for Data 63:32 
Data<32:63> 

LW Parity for Data<63:32> 


Scan Chain for Part 2 of SCache: 


Size: 164 bits 

0 S*IFB_PAR_H<0> 

1:32 SKIFB_H<0:31> 

33 8_DIR_CTL*®LSEL_WSC_H 
34:44 8 DCR®ADDR_7A_L<144> 
45 8_DCR&HIT_H<> 

46 8_DCR&HIT_H<I> 

47 8 _DCR&HIT_H<t> 

48 8_DIR_CTL&RSEL_WSC_H 
49:80 S4IFB_H<3268> 

81 S*IFB_PAR_H<l> 

Bit # Signal name 

0 S*IFB_PAR_H<2> 

1 SRIFB_H<O4:95> 

33 8 _DIL_CTL&LSEL_WSC_H 
34:44 S_DCL®ADDR_7A_L<144> 
45 8_DCL&HIT_H<a> 

46 S_DCL®HIT_H<i> 

47 S_DCL&HIT_H<0> 

48 8_DIL_CTL&RSEL_WSC_H 
49:80 S*IFB_H<96:127> 

81 S*IFB_PAR_H<3> 


11.8 Controllability Features 


TBD. 
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Remarks 


LW Parity for Data<95:64> 
Data<64:95> 

LW write enable for Data<95:64> 
Address driven to SCache 
SET_HIT signal, set 2 

SET_HIT signal, set 1 

SET_HIT signal, set 0 

LW write enable for Data<127:96> 
Data<96:127> 

LW Parity for Data<127:96> 
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11.9 Boundary Scan Register 


DECchip 21164-AA Boundary Scan Register is approx. 286 bits long. Table 11-7 gives the 
boundary scan register organization. The Boundary scan register begins at the TDI_H pin and 
traverses in clock-wise direction and ends at TDO_H pin. 


NOT FINAL 


The list below is based on the DECchip 21164-AA die size and pad assignments as of 
11/23/92. 


Table 11-7: Boundary Scan Register Organization 


Signal Name Type Count BSR Cell Remarks 
P%®TDI B 1 None 

P%SROM_OE_L oO 1 out_beell 
P%SROM_CLK_H oO 1 out_bcell 
P%SROM_DATA_H B 1 in_beell 
P%SROM_PRESENT_L B 1 in_beell was SROM_DISABLE 
P%PORT_MODE_H< 0:1> I 2 in_bcell 
P%SYS_RESET_L I 1 in_beell 

P%DC_OK_H I 1 in_beell 
P%SYS_MCH_CHK I 1 in_bcell 
P%PWR_FAIL_IRQ I 1 in_beell 
P%MCH_HALT_IRQ I 1 in_bcell 

P%®IRQ< 3:0> I 4 in_bcell 

P%CLK_IN_H, _L I 2 in_beell 
P%CPU_CLK_OUT_H Oo 1 out_beell 
P%SYS_CLK_OUT_H, _L O 2 out_beell 
P®%SYS_CLK2_OUT_H, _L O 1 out_bcell 
P%ECL_OUT_H I 1 in_beell 

P%VREF_H I 1 in_bcell 
P%REF_CLK_IN_H, L I 2 in_beell 

P%PERF_MON< 0> I 1 in_beell 

P®ADDR< 21:5> B 17 io_bcell U-R corner 
P%ADDR< 4> B 1 io_beell —- U-R corner 
P%®DATA< 063:0> B 64 io_bcell 
P%DATA_CHECK< 0:7> B 8 io_bcell 
P%®DATA_VALID< 1:0> r@) 2 out_bcell 
P%EV5_SC_SET< 1:0> oO 2 out_beell L-R Corner 
P%BC_INDEX< 25:4> oO 22 out_beell L-R Corner 
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Table 11-7 (Cont.): Boundary Scan Register Organization 


a ee 
P%EV5_ADDR_RES< 1:0> O 2 out_beell 
P%IDLE_BC I 1 in_bcell 
P%SYS_LCK_FLG I 1 in_beell 
P%DATA_BUS_REQ H I 1 in_beell 
P%ADDR_BUS_REQ H I 1 in_beell 
P%FILL_NOCHK I 1 in_beell 
P%FILL_ERR I 1 in_beell 
P%FILL_ID_H I 1 in_beell 
P%FILL_H I 1 in_beell 
P%DACK_H I 1 in_beell 
P%CFAIL_H I 1 in_bcell 
P%CACK_H I 1 in_beell 
P%ADDR_CMD_PAR_H B 1 io_beell 

* P%VTM_PENDING re) 1 in_beell 
P%DATA_RAM_WE 0 1 out_beell 
P%DATA_RAM_OE oO 1 out_bcell 
P%TAG_RAM_WE oO 1 out_beell 
P%TAG _RAM_OE 8) 1 out_beell 
P%EV5_CMD< 0:3> B 4 io_beell 
P%TAG_DAT_PAR B 1 io_bcell 
P%TAG_CTL_PAR B 1 io_bcell 
P%TAG_DIRTY B 1 io_bcell 
P%TAG_SHARED B 1 io_bcell 
P%TAG_ VALID B 1 io_beell 
P%BC_TAG< 20:38> B 19 io_bcell L-L Corner 
P%DATA_VALID< 2:3> oO 2 out_beell 
P%DATA_CHECK< 15:8> B 8 out_beell 
P%DATA< 064:127 > B 64 io_beel] 
P%ADDR_H< 39:37> B 3 io_bcell U-L Corner 
P%®ADDR_H< 36:22> B 15 io_bcell U-L Corner 
P%spare 7 1 io_beell Captures zero 
P%TEST_STATUS_H< 1:0> r@) 2 out_bcell 
P%TRST_L I 1 None 
P%TCK B 1 None 
P%TMS B 1 None 
P%TDO .6) 1 None 
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Table 11-7 (Cont.): Boundary Scan Register Organization 
Signal Name Type Count BSR Cell Remarks 


en_for_left_data sig 1 out_bcell thd 
en_for_right_data sig 1 out_beell thd 
en_for_be_tag sig 1 out_beell thd 
en_for_?? sig 1 out_beell thd 


11.10 Testability IPRs 


The following is the list of IPRs connected to testability features. See chapter on PALCode and 
IPRs for more details. 


1. 


Ame wh 


TEST_STATUS_H<0> read and TEST_STATUS_H<1> write (ICSR). 
Debug port visibility select bits in IPRs (TBD). 

Serial Terminal Port IPRs (SL_RCV, SL_XMIT) 

Scache IPRs (SC_CTL, SC_ADDR) 

Deache IPRs (DC_MODE, DC_TEST_CTL, DC_TEST_TAG._ TEMP) 
Beache IPRs (BC_CONTROL, BC_TAG_ADDR) 


11.11. Test Feature Reset and Initialization 


Reset, initialization and defaults of testability features are described through-out this chapter 
and in the chapter on Reset and Initialization. For convenience, this section summarizes the 
power-on reset sequence, as it pertains to the testability features for the normal operation. The 
sequence of events is as follows: 


1. 
2. 


3. 


SYS_RESET_L is asserted. 


The values on the SROM_PRESENT_L and PORT_MODE_H<1> pins are sampled on SYS_ 
RESET_L deassertion. 


If BiSt is bypassed (indicated by a ’1’ sampled on PORT_MODE_H<1>), go to the next step. 


If the BiSt is not bypassed, keep rest of the chip in reset state. Perform ICache BiSt (and 
BiSR, if BiSR is required). Clear ICache Tag valid bits at the end of BiSt. 


If SROMs are not present, (indicated by ’1’ sampled on SROM_PRESENT_L), go to the next 
step. 


If SROMs are present, keep rest of the chip in reset state. Load ICache from the SROMs. 
Deassert internal reset. Fetch the first instruction. 
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Table 11-7 (Cont.): Boundary Scan Register Organization 


Signal Name Type Count BSR Cell Remarks 
en_for_left_data sig 1 out_bcell thd 
en_for_right_data sig 1 out_beell thd 
en_for_be_tag sig 1 out_beell thd 
en_for_?? sig 1 out_beell thd 


11.10 Testability IPRs 


The following is the list of IPRs connected to testability features. See chapter on PALCode and 
IPRs for more details. 


1. 


PAF Sh 


TEST_STATUS_H<0> read and TEST_STATUS_H<1> write (ICSR). 
Debug port visibility select bits in IPRs (TBD). 

Serial Terminal Port IPRs (SL_RCV, SL_XMIT) 

Scache IPRs (SC_CTL, SC_ADDR) 

Deache IPRs (DC_MODE, DC_TEST_CTL, DC_TEST_TAG_TEMP) 
Beache IPRs (BC_CONTROL, BC_TAG_ADDR) 


11.11 Test Feature Reset and Initialization 


Reset, initialization and defaults of testability features are described through-out this chapter 
and in the chapter on Reset and Initialization. For convenience, this section summarizes the 
power-on reset sequence, as it pertains to the testability features for the normal operation. The 
sequence of events is as follows: 


1. 
2. 


3. 


SYS_RESET_L is asserted. 


The values on the SROM_PRESENT_L and PORT_MODE_H<!1> pins are sampled on SYS_ 
RESET_L deassertion. 


If BiSt is bypassed (indicated by a ’1’ sampled on PORT_MODE_H<1>), go to the next step. 


If the BiSt is not bypassed, keep rest of the chip in reset state. Perform [Cache BiSt (and 
BiSR, if BiSR is required). Clear ICache Tag valid bits at the end of BiSt. 


If SROMs are not present, (indicated by ’1’ sampled on SROM_PRESENT_L), go to the next 
step. 

If SROMs are present, keep rest of the chip in reset state. Load [Cache from the SROMs. 
Deassert internal reset. Fetch the first instruction. 
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11.12 Open Issues 


1. 


a 


Should we make chip run w/o external oscillator and with internal PLL during EXTEST and 
HIGHZ instructions? 


Details of bits in OBL and OBS chains to be defined. 
Details of signals brought to the parallel debug port need to be defined. 


The following additional test feature enhancements on boundary scan are currently being 
considered 


e CLAMP_IO Instruction. 
¢ A ring oscillator mode for the boundary scan. 
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11.13 Revision History 


Table 11-8: Revision History 


Who When Description of change 

Dilip Bhavsar 2/13/92 Working draft 

Dilip Bhavsar 6/25/92 Working draft 

Dilip Bhavsar 7/1/92 Working draft 

Dilip Bhavsar 9/16/92 Rev 0.1 Changes: Second test_status_h pin added. TCR size 


changed from 4 to 8 bits to program cpu cycle to be captured 
during scan. Opcodes redefined 


Dilip Bhavsar 11/23/92 Rev 1.0 Clean-up and updates 
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EV5 interface update 
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To: @spec_holders Date: 2-December 1993 


From: Peter Bannon 
Dept: SEG/HPC 

DTN: (8) 225-5249 
Loc/Mail: HLO2-3/D11 
Net mail: ROCK: :BANNON 


TITLE: EV5 Interface Update 


1 INTRODUCTION 


The purpose of this note is to update everyone on changes to the 
EV5 interface since Rev 1.9 of the spec. This note was first 
released on February 24, 1993. Change bars mark updates since 
that release. 


2 KNOWN BUGS IN PASS1 


The following bugs have been found in EV5 and will not be fixed 
for PASS1. Systems should be careful to avoid these problems. 


2.1 WRITE BLOCK LOCK 


A WRITE BLOCK LOCK is caused by a store conditional instruction 
to I/O space. Two octawords of data will be provide by EV5, each 
requiring a DACK. If the system asserts DACK for the first 
octaword, and asserts CACK and CFAIL at the time, and the sysclk 
ratio is 3, EV5 will hang. 


If DACK, CACK, and CFAIL are asserted for the second cycle the 
write will be failed correctly. 


If CACK and CFAIL are asserted at any time without DACK, the 
write will be failed correctly. 


If the sysclk ratio is something other than 3, any legal 
combination of DACK, CACK, and CFAIL will cause the write to fail 
correctly. 
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2.2 WRITE BLOCK 


When doing WRITE BLOCK, EV5 should align the address so 
ADDR_H<5:4> is zero for 64 byte block systems or ADDR_H<4> is 
zero for 32 byte block systems. This works correctly for shared 
writes and writes to I/O space. 


However, if the Bcache is off, and EV5 is using a WRITE BLOCK 
command to write back an Scache victim to memory, EV5 may not 
correctly align ADDR_H. The data will be written as if the 
alignment had been done. To avoid this problem, system should 
ignore ADDR_H<5:4> (ADDR_H<4> for 32 byte systems) when the 
address for a WRITE BLOCK is in cachable space (ADDR_H<39>=0). 


If the Bcache is enabled this problem will not occur. 


3 IPR CHANGES 


Please consult the October pre-release of the IPR chapter for 


details. A copy can be obtained by contacting John Edmondson, 
ROCK: : EDMONDSON. 


4 UNUSED PINS 


Unused pins may be left unconnected. 


5 PRIVATE READ TIMING 


There have been two changes to private read timing. The addition 
of wave pipelining and changes to the timing of output enable. 


5.1 Wave Pipelining 


Wave pipelining has been added for 64B block size caches. Wave 
pipelining is not supported for systems that have a 32B block 
size. 


To enable wave pipelining, the BC_RD_ SPD should be set to the 
latency of the Bcache read. BC _CONTROL<18:17> should be set to 
the number of cycles to subtract from BC_RD SPD to get the Bcache 
repetition rate. For example, if BC_RD SPD is set to 7 and 
BC_CONTROL<18:17> is set to 2, it will take 7 cycles for valid 
data to arrive at the pins, but a new read will start every 5 
cycles. 
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The read repetition rate must be greater than 3. For example it 
is not permitted to set BC_RD SPD to 5 and BC_CONTROL<18:17> to 
Zs 


BC_RD SPD=6, BC_CONTROL<18:17>=2: 
V - shows were EV5 would clock the Bcache data into the pad ring 
Vv Vv 


Vv Vv 
[2 Peal STS ee alas) Slee Salsa lela lee 1 
Index ...J--- I0 --->|--- Il --->[--- I2 --->{--- I3 --->[......... 
oye a a eee oe ene 00000 .6 anes a Oh i os area COOL 28 8 25 ee 333333 
OR ....e. [1211111121111111111111111111111111111111111111111111|... 


The value of BCCONTROL<18:17> should be added to the normal value 
of BC_CNFG<14:12> to increase the time between reads and writes. 
This will prevent a write from starting before the last data of a 
read is received. 


5.2 Output Enable 


TAG RAM OE H and DATA _RAM OE H will not assert during the first 
CPU cycle of a private Bcache read. 


Taped Garett Ged are Sareet red Ged eel Darel ed Gentle Uae ered Cel rel rel aca 
Index. ..)<=— 10) =—-s>|=-— I1 =--S 2-72 =--S)-<5- 13. ==-5] 
OE ....e.. {22111111111111111111111111111111111111111111| 


5.3 Private Write Timing 


To improve the data hold time at the end of the write, the data 
will be driven one CPU cycle after the index. It will be 
de-asserted one cycle after the index. For private writes to the 
Bceache the write pulse can be programmed in each cycle. 


set Feel Set Sel erred Sarl Dred ed Carel! vad ed lerdal Ded Gated eed cl Ctl Cart 


Index ...[--- I0 --->|[--- I1 --->|--- I2 --->|--- 13 --->[......... 
Data ...... 100000000000] 11111111111 | 22222222222 | 33333333333|...... 
WE ...|PPPPPPPPPPP|PPPPPPPPPPP|PPPPPPPPPPP|PPPPPPPPPPP|........-. 


P - programmable write pulse 


Note that the minimum value of BCCONFIG<18:16>, FILL WE OFFSET is 
still 1, which will add one CPU cycle of disabled write pulse at 
the start of each fill. 


6 CALCULATING IDLE BC 


The rule for calculating IDLE _BC has changed. The new equation 
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is: 


if (block_size == 32b) 


x= 2 
elise 
x = 0 


read_ hit idle 2+ x + (block_size/16)*BC_RD_SPD + 


tri-state_ram_turn_off - 3*wave pipelining 


7 + BC_RD SPD + Sysclk_ ratio + tri-state_RAM turn off 
4 + (block size/16)*BC_WRT_SPD + tri-state _EV5_turn_off 


read _ miss idle 
write idle 


Take the largest of the three times and then round up to the next 
Sysclk boundary. 


When determining the tri-state _turn_off times if the System will 
not turn on it’s drivers for some number of nanoseconds after EV5 
starts driving the Bcache index, this time can be used to reduce 
the tri-state _turn_ off time. 


For example if the sysclk ratio is 6, 64b block, Bcache 
read/write speed of 5, with no wave pipelining, 2 cycles for 
tri-state read, 0 cycles for tri-state write, the equations would 
work out to: 


read hit idle = 2 + 0 + (64/16)*5 + 2 - 3*0 = 24 
read miss idle = 7 +5 + 6+ 2 = 20 
write idle = 4 + (64/16)*5 + 0 = 24 


MAX (24/6,20/6,24/6) = 4 


N N+1l N+2 N+3  N+4 
|~---- | ----- |----- |----~ |----- jase |----- |---~- |---~- |---- 
Tdie. BC... eLIP TTA LILA ITLL eae eee ee ae bale Ba Se le el ee 
gs i nee ease rae gare en ane eer er eee Ee es Ds Ba Cece ee ee ena ee a eee 
DAG sa desetivas side Gh dese. 18 Soom cesses avene etede ded alee ah 111111111111111111111111. 
ANOS) Best saw cde, See See eee ne alia ge [£0 23> [tie = | $2 =| P38 H leew 
Data” cgi Sve iste cece te scales Shelves es ote ead easels OG 000000111111222222333333... 


If EV5 receives IDLE BC asserted at sysclk edge N, the FILL 
command can be received at sysclk edge N+3. EV5 will drive the 
index to fill the Bcache on sysclk edge Nt4. 


7 ‘FILLS 


During a fill from memory, the Bcache tag store is written for 
each INT16 of data. The system is required to drive the share, 
dirty, and parity bits with the correct value for the entire 
fill. 
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8 TRI-STATE 
8.1 Private Read Or Write To FILL 


The time required to tri-state the EV5 drivers at the end of a 
write, or the the Bcache drivers at the end of a read is part of 
the IDLE BC equation. 


8.2 Bcache Victim To FILL 


There are two read miss with victim cases to consider, one if the 
READ MISS will be first, and another if the READ MISS will be 
second. 


8.2.1 Read Miss First - 


The time to turn off the Bcache drivers at the end of a Bcache 
victim is fixed by the EV5 design. The system must allow for 
this time before starting a fill. 


The final DACK will be received by EV5 on the rising edge of 
sysclk. If the corresponding rising CPU clock edge is labeled N, 
OE will de-assert at the rising edge of CPU clock Nt+4. 


N N+1 N+2 N+3 N+4 
1 6 IM ate et esa Fa gma! eae Ferd pe eel care (alae 
SYScl1k | -oe ern ren et ene  [Kooclententententententententanbatententententententanten  Hestantententententententententantentententententententen 
DaCk  \elaaeauia Sco ie DEDEDE DY PT VT ae ee eke seal Shae vee acdc dle @ Sisal catelnec & Stews oon 
Index vV3<rc83r ec rest cere [ocebe tate weet ton sat ee te, Gotcha athe dor “be Bie we Bhat Bi Atte bye Gari tela Vageeltacde 
Data SSSS 33S 333333333 33333 esc a5 sick ane Sie el Radice cele le: ayia le we Bi aire cade: enbxacte:oueese LS 
OF LLLLLLLLILILLLILILLILILiLILiLILIII1Iliiyi. eee. 


8.2.2 Read Miss Second - 


The time to turn off the Bcache drivers at the end of a Bcache 
victim is fixed by the EV5 design. The system must allow for 
this time before starting a fill. 


The final DACK will be received by EV5 on the rising edge of 
sysclk. If the corresponding rising CPU clock edge is labeled N, 
then the READMISS command will arrive on the next sysclk edge, 
and the OE will de-assert at the rising edge of CPU clock N+S+1l, 
where S is the sysclk ratio. If the sysclk ratio is 3, it will 
take an extra sysclk to send the READMISS command, so the OE will 
de-assert at N+2S+1. 
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N N+S N+St+1 
CPUCLK |----|---- |---| -2-= |--2 = |-a-= | oo | eno | oo = [o-oo = Io 
OMAN 8 9 hg Se 8 a Ape hte ac Ns 2 cree aeis Geta ae eee |READ MISS---------- 
SYSclk |------------------- | ------------------- | ----=-------------- 
DAGK. 1) eid: Petes eae 11111111111111111111 


8.3 System Command To FILL 


At the end of a system command that uses the Bcache, the system 
must provide enough time for the Bcache drivers to turn off 
before returning any fill data. 


The final DACK will be received by EV5 on the rising edge of 
sysclk. If the corresponding rising CPU clock edge is labeled N, 
OE will de-assert at the rising edge of CPU clock Nt5. 


N N+1 Nt+2 N+3 Nt+4 Nt+5 


CRUGLE) perso eset ere) eta) Aaes eres ee ae ieee ere (mae aleaee 
StoCly ees —aSSe Sean oS | ea ee er oe [Pea eeos SSS a sa 
DACK. Se sneha eee TTL A A Teds Ved LT: De Me cine staan dy tiear nah sone BRR a wm fh Gels wa Wl ace 
Index “Vvs3=shas=so35 25S S-— | Aas ak ule: ne arava ateg eect a ah oie (LOSS e ser eSesHe-4-s 
Data DSSS ISIS IS OBI ISO SIO IS 5,5 kh wiaraais 26 LER RE Cha ae Raed es [d0+ss44+= 
OE LUTTE LID ITELIT ILLITE sy a 56 8. seen ee aes 
BE “aise oe ee 1111111111111111111 


A side effect of this is the earliest assertion of FILL after a 
system command. The system must allow time for the OE to go off 
and the RAMs to stop driving the bus before the system drives the 
fill data. 


If the system command was a set shared or an invalidate command, 
the system must allow time for EV5 to complete the Bcache tag 
write and then for the drivers to turn off before driving the 
TAG SHARED, TAG DIRTY, and TAG CTL PAR wires. 


EV5 will begin the tag write one CPU cycle after the response is 
sent to the system. The write will take BC _WRT_ SPD cycles to 
complete. During the write the DATA RAM OE will be asserted. 
TAG RAM OE will not. At the end of the write TAG RAM OE will 
pulse for one CPU cycle and then both will go off. In the 
picture below, if the response is is driven at the rising edge of 
CPU clock N, then the OE will fall at N+2+BC_ WRT SPD, or N+6 for 
a 4 cycle write speed. = 
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N N+2+BC_WRT_SPD 

CRUG TR esr aa ar ae e Sass asa ee ee Ea ee ae eres esse 
SYSclk |-------------------  loutententeienleatetentententenienhatenesbatetentan  Kestahententententenentntenantntentetetetestaten 
Res | ACK/Bcache---—------ | eee Fer ee Oy oxo sorce aiho hoe arta ee: ae wee ee Bcd Que ede eek carla 
Index ..... | tag write -------- | 
Diadtea!  reip a ecep tA se er ay ot cae ei cae Sal ay af as Rhee Mecaal Vp sa a ede es icandes Gta Shwe ate Capes wr eden eRe anate'e ate csiare 
Tag OR wa sivesee ae ee ade ew ee ease 111111 


ee ee 


8.4 FILL To Private Read Or Write 


At the end of the fill, EV5 will not begin to drive the data bus 
until the 5th CPU cycle after the Sysclk that loads the last 
DACK. EV5 will not assert OE until the 5th cycle after the 
Sysclk that loads the last DACK. 


If systems require more time to turn off their drivers, they must 
use IDLE BC and DATA_BUS REQ at the end of the fill to stop EV5 
requests and not send any system requests. 


N Nt+l N+2 N+3 Nt+4 Nt5 


CPUGLK |=--= |---| oan = [ean | oon Penne [oe denne [oan lene [oon Lo 
SYSclk |------------------- | ------------------- | ------------------- 
DAGK 4° a site eteae Fe is Wes af OK Be i Dg Oh A Ue see ere 
Index: 33333323333333333333333333333333. 0.06855 00000000000000000000 
Data 333333335333333333353 333353333336 hc Oso wee’ Merwe Hs 00000000000 
QB cde G5 USP eu dee ttced yon ates ire, SG eeacahe a ew BLS Nee she woe FS ee 111111111111111 


9 COMMAND ENCODINGS 


Two new commands have been added to support write lockout on 
TurboLaser systems. Other systems should respond to the 
READ MS MOD LKn commands as if it were are regular 
READ MISS MODn. 


digital confidential *** 


See 
end 


10 


The 
for 


EV5_CMD__NOP 

EV5_CMD LOCK 
EV5_CMD FETCH 
EV5_CMD__FETCH | 
EV5_CMD__ MB 

EV5_CMD SET DIRTY 
EV5_CMD WRITE BLOCK 


cee 


EV5 CMD WRITE BLOCK LOCK 


EV5 CMD READ MISSO — 
EV5 CMD READ MISS1 

EV5 CMD READ MISS MODO 
EV5 CMD READ MISS MOD1 
EV5 CMD BCACHE VICTIM 


EV5 CMD RESERVE 


| 


| 


| 


EV5 CMD READ MS MOD STx0 
EV5 CMD READ MS MOD STx1 


WOMTNOPWNHEH © 


PRR eee 
UBRWHEO 
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John Edmondson’s memo on write timeliness, which is append to 


of this note. 


EVS RESPONSES TO SYSTEM REQUESTS 


following table shows the responses that EV5 will 


each system request. 


generate 
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BCACHE SCACHE RESPONSE 
FLUSH None Sc_ Miss Noack 
None Sc_ Hit, not dirty Noack 
None Sc_ Hit, dirty Ack_ Sc 
None hit in Sc_Victim_Buffer Ack Sc 
Bc Miss Sc_Miss Noack 
Bc Miss Sc Hit, not dirty Noack 
Be Miss/Hit Sc Hit, dirty Ack_Sc 
Bc Miss/Hit hit in Sc Victim Buffer Ack Sc 
Be Hit, not dirty Sc Miss/Hit, not dirty Noack 
Be Hit, dirty Sc Miss/Hit, not dirty Ack Bc 
INVALIDATE None Sc_ Miss Noack 
SET SHARED None Sc_ Hit Ack_Sc 
Bc Miss/Hit Sc Miss/Hit Ack Bc 
READ None Sc Miss Noack 
None Sc_ Hit Ack Sc 
None hit in Sc _ Victim Buffer Ack_Sc 
Be Miss Sc_ Miss Noack 
Be Miss/Hit Sc Hit Ack Sc 
Be Miss/Hit hit in Sc Victim Buffer Ack Sc 
Be Hit Sc Miss Ack _Bc 
READ DIRTY None Sc_Miss Noack 
RD DIRTY INV None Sc Hit, not dirty Noack 
- = None Sc Hit, dirty Ack_Sc 
None hit in Sc Victim Buffer Ack Sc 
Be Hit, dirty Sc Hit, dirty — Ack Sc 
Be_ Hit, dirty hit in Sc Victim Buffer Ack_Sc 
Be Hit, dirty Sc _Miss/Hit, not dirty Ack Bc 
10.1 Scache Tag/Data Par_err - Any Command 
Should this occur while Cbox is processing an incoming system 
command, it will be completed as normal and the usual response 
transmitted. However, the parity error will eventually be logged 
in the SC_ADDR and SC_STAT Cbox ipr’s, and a machine check is 


generated by Ibox. 
10.2 System Address Command Par _err - Any Command 


Same as above, 
and EI STAT 


except the parity error is logged in the 


EI_ ADDR 
ipr’s, 


and Cbhox will terminate the incoming system 
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command and respond with NOACK. 


11 SENDING SYSTEM REQUEST TO EV5 


The rules for sending a request from the system to EV5 have 
changed some. The new rules follow this model: 


if (init) 
count = 0; 


if (cmd && (count < 2)) { 
count ++; 


/* ADDR_BUS REQ can already be asserted */ 
send (cmd) 
} 


if (ev5 res == ack/scache) { 
if (cmd == read _ dirty | cmd == read dirty inv 
cmd == read { cmd == flush) { 


/* first, receive all the data */ 


count --; 


} 
else f 

count ~--; 
} 


} 


if (ev5_res == ack/bcache | 
ev5 res == noack) { 
count --; 


12 SEQUENCING CPU AND SYSTEM REQUEST THROUGH THE BCACHE 


This section goes over the rules for determining the order in 
which the BIU processes EV5 and System requests. In general the 
order of processing is determined by the system using EV5 CMD, 
IDLE BC, and the FILL. 


1. If IDLE BC is not asserted and there are no valid requests in 
the system command buffer, then EV5 is free to do any CPU 
request. 
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2. If a FILL is pending, EV5 will only produce another read 
miss, with a possible Bcache victim. It will not attempt any 
other command. 


3. The assertion of IDLE BC, or the sending of a valid non-NOP 
system command to EV5 will cause the BIU to idle. If the BIU 
has a command loaded in the pad ring, it will remove the 
command and replace it with a NOP. The state of the 
command/address bus is unpredictable until the idle condition 
goes away. 


4. The idle condition ends when EV5 recieves a de-asserted 
IDLE BC, and EV5 has responded to all the system commands 
that were sent. 


5. The System must not assert CACK during the idle condition. 


6. There is one exception to rules 3, 4, and 5. If IDLE BC ora 
system command arrives while EV5 is reading the Bcache, and 
that read turns into a miss, and it does not produce a 
victim, then EV5 will load the miss into the pad ring. The 
system may CACK this read miss request at any time. 


7.  %If CACK is asserted at the same time as IDLE BC or a valid 
system request, CACK wins and the command is taken by the 
system. CACK should not be asserted if IDLE BC has been 
asserted or a valid System command is underway. 


8. A read miss with a victim is treated as a pair. The order, 
read miss then victim or victim then read miss, is 
programmable. Either way, if the first command is CACKed, 
then both commands must be CACKed and all the data DACKed, 
before EV5 will respond to any other request. 


9. The CACK for a WRITE BLOCK or BCACHE VICTIM must be received 
by EV5 with or before the last DACK of the data. For 
WRITE BLOCK and BCACHE VICTIM, it is possible to DACK all but 
the last data, and then decide to do something else. 


10. The CACK for a READ MISS must arrive with or before the last 
DACK for the requested fill. 


12.1 Read Miss With Victim Example 


In this example EV5 asserts a read miss with a victim. The 
system Dacks two datas from the Bcache and then asserts IDLE BC. 
This causes EV5 to remove the read miss with victim pending. EV5 
will reassert the read miss and victim, if needed, at a later 
time. 
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[atlases eae Salaal esa ea lala aise ies ies| aeies [ears i= 
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AGRO Gs ees ac decte apse Gaye lates Slee Ob esate a ar See see Ce Sas ele ce aS a Uo tina gice ae. a Uae alle be 
TDG. BCrtaide te gcestoaies 11112111111111111111111111111111111111111111111 
CACO ee elds rete Seed ote cepa a: Shier meee as are EB Sea Bebe mee a Sad eee hh See aa ee eR 
Dack .......... 111 DD Vidic ech lare eS etek eh Ie oe aes eee ee Siete ee eee ee 
Index ... | -L0-—[ mide | mi 2m ec te ett nee 
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12.2 IDLE _BC And CACK Race Example 


In this example, IDLE_BC and Cack are asserted in the same 
Sysclk. This means the the System will take the read miss and 
victim before doing anything else. Note that the last Dack for 
the victim is in the same cycle as the Cack for the victim 
command, meeting the requirement that the Cack arrive before or 
with the last Dack. 


lasso |=S)s5 (Sle Se )es feces ao helea| Saale lesiea Hs 
| CMD See See mise —> Np | be Vices (se hens ear ee ores Ree es 
| ADDR eg a ae ee Th fe tote nett i A era a ee Bigre 
[Mepend <i6:6. [DEED BET ss gew se evactoe yee a cae a ew aia ee Re oe gee Grips wan gee 
AG REG Ne sctuna Sete ea bewh eyo arias aye teases ies as ca ste-fal ido tanhal aio pede aaah chaste Gag ttas ae la Ta aba oa od ana etna canbe ie 
IDLE BC Ebi Siw reer gtia cel weeieee 111111111211111111111111111111111111111111111111 
CaCkK laste Sade eee ds ae Les Ceara a ence A As ce sti gud Sanat yace tare ve tave a eesasend, cuduaug lee 
DacK.  siew hoe wie ees secs DDD epee SU Des ia Die eee UOT: ache ace tana ance Mate td ee ed Bee hare etre oe 
Index ...... |) -TO-—| -I1l--| -I2--—|-I3--| 2... ee te 
Data: sete aeceteces OCOOO EPETV A 222227 3333 3. hve eon ete bk eee Rk Site Pe ee we we 
OE 111111111111111111111111111111 | 


12.3 Read Miss With IDLE BC Asserted Example 


In this example, EV5 has started a Bcache read that misses. 
IDLE BC is asserted, but no victim was created, so the read miss 
request is loaded into the pad ring. The system then takes’ the 
request. 
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12.4 Read Miss With Victim Abort Example 


In this example, EV5 produces a read miss with a victim and is 
waiting for the system to take it when the system takes the bus 
and requests a read dirty. EV5 drives the miss request for one 
more cycle after it gets the bus’ back and then removes the 
request. EV5 then responds to the read dirty and drive the index 
to read the Bcache. Not shown in the picture is EV5 restarting 
the Bcache read, requesting the read miss with victim. If the 
victim block was invalidated by the system request, EV5 will 
produce a clean read miss. 


Pooh |eel=s|Se|-s esos S=(S5 ele baler lSsleslee =| 
CMD: Soe 6 ss Sede Sere [ard ms. | Md.) MS ans yen Reece e es ECG ae eee 
ADDR 3 sees aed avers [aed Seats) aeaeae: aha | sede, teeta terete te ee eee etec ale cota ts A a atete 
Vo Spen Ge iraiie ee oul [DEELEY TVET LEA eee eth a eetees wee eaelatens. Giend: Sea ee ad 
Ad Req ....- ee ee ee ee eee eee DD ee cai cteren ase Sl eneesb al oes Wea be car rete ce Abd ovens: Baez Se AUS a nate 
BVO RES 44 oscefn a ees ak wae woe Be aaa ee ey Cee hse tas gee etal aia ce arecha cays 
ET TD Ese SG airs arabew Sess ses eat a ce ace cet id Ueda ohein) Scere a an ae a eh ah ed evan aa saa ahi 
CACK “ha Siw teeseie Beene has a ies oe tee eS ae hea, eo nae Soa eked dee eite a pa kee wee OES, WSLS 
DaAGK >: erg bend wlan Seite hacer we Se aie ei earache fas Bae ote ars TEL oo oe Ld 
index .22iPSDROS- | Acne ete s Soe wa eee eS |} -r0--|-r1l--|-r2--|-r3--|... 
Data gees nd OOOO. an so. idk Swe ta sates fe faba Eire eRe ese Ca thay OS a cap ae ncaa Seldo ros Boe beet Sides 
OE eeepc WALT I, he aren aoe fs Sse cas ee es Srset a eteed eed Sar Selena hh Be Sera ae Sue eae 


12.5 Bcahe Hit Under Read Miss Example 


In this example, EV5 produces a read miss and requests a fill 
from the _ system. A Bcache hit to index j take place while we 
wait for the fill. The system then return the requested data in 
two bursts, asserting CACK at the same time as the last Dack. 
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SUBJECT: Proposed Solution to Write Timeliness Problem 


EV5 is prepared to make a pin-bus command encoding change, 
provided the Turbo-Laser team confirms they want it. The change 
is intended to solve the problems that arise from the difference 
between the EV5 cache-coherence protocol and the Turbo-Laser bus 
cache-coherence protocol. The problems are write timeliness and 
EV5 hanging due to write operation livelock. I describe in 
detail the proposed EV5 pin-bus changes and the proposed 
external logic which utilizes this change to solve the stated 
problems. 


Description of the Protocol Bug 


This is a description of the protocol bug which results from the 
difference between the EV5 cache-coherence protocol and the 
Turbo-Laser bus cache-coherence protocol. The bug occurs when 
one processor (CPU A) repeatedly writes to block X and another 
processor (CPU B) has to write block X too. CPU B’s write 
misses in its caches and causes a READ-MISS-MOD. That 
READ-MISS-MOD hits dirty in CPU A’s cache and causes the block 
to become SHARED in CPU A’s cache. After CPU B’s cache receives 
the fill of block X, CPU B tries to send a WRITE-BLOCK to its 
bus interface. Meanwhile, a subsequent write by CPU A is 
broadcast on the bus (CPU A wins arb since CPU B just used the 
bus and takes a certain amount of time to respond to the fill 
and produce the WRITE-BLOCK command). Since CPU A’s write won 
arb on the bus, CPU B’s bus interface has to force CPU B off its 
command bus (via ADDR BUS REQ H) and sent CPU B an invalidate of 
block X. This means that CPU B will have to abort the pending 
WRITE-BLOCK to block X and start over by sending a READ-MISS-MOD 
to block X. Meanwhile, CPU A converted block X to PRIVATE-CLEAN 
and responds to CPU B’s READ-MISS-MOD by making block X shared 
in its cache. This begins the process over again. Notice that 
CPU A completes new writes again and again while CPU B is never 


digital confidential *** Page 16 


able to complete even one write. 
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The EV5 Change 


The proposed change in the pin-bus command encodings is: 


The encoding for WRITE BLOCK and WRITE BLOCK LOCK from 14 and 
15, respectively, to 6 and 7, respectively. (This undoes an 
earlier change which I think was proposed by the Turbo~Laser 
team. The reason should be clear soon.) 


Two new commands are added: READ-MISS-STCO and READ-MISS-STC1. 
The encodings for these are 14 and 15 (decimal). These commands 
are used exclusively for fills due to store conditional 
execution. The following table shows the four relevant commands 
and their binary encodings. 


CMD<3:0> Command Optional Comments 

1010 READ MISS MODO No Request for data, modify in 
1011 READ MISS MOD1 No Request for data, modify in 
1110 READ MISS STCO No Request for data, STx_C spe 
1111 READ MISS STC1 No Request for data, STx_C spe 


The choice of encodings allows the Cbox to logic-or CMD<2> with 
the status indicating STx_C, making this change feasible. 


EV5 will use READ MISSn for read misses without modify intent, 
READ MISS MODn for read misses to complete an ordinary write, 
and READ MISS STCn for read misses to complete a store 
conditional. EV5’s behavior is otherwise unchanged from before 
(1.e., READ MISS STCn should lead to a FILLn and so on). 


The Proposed External Logic 


To solve both the write timeliness problem and the problem of 
EV5 hanging, I propose Turbo-Laser implement the following 
mechanism. It uses LOCKOUT assertion on the Turbo-Laser bus to 
guarantee a write completes. This proposed mechanism detects 
when a write is not likely to complete without LOCKOUT assertion 
and asserts LOCKOUT. Then it correctly detects completion of 
the write and deasserts LOCKOUT. There is no non-error case in 
which LOCKOUT is asserted for a write that has been aborted by 
EV5, so a short timeout on LOCKOUT is not necessary. I’11 
discuss errors later. 
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For completeness, I assume LOCKOUT is a wired-or signal on the 
Turbo-Laser bus. When it is asserted, each CPU module that is 
not asserting it must not attempt to arb for the bus. Asa 
result the one or more CPUs asserting LOCKOUT get preferential 
treatment on the bus until they stop asserting lockout. If the 
CPUs use this to complete pre-existing write operations and not 
for writes which begin after LOCKOUT assertion, then LOCKOUT 
deassertion is guaranteed in a bounded amount of time. 
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Many of the ideas included in this proposal were given to me by 
Dennis Foley. 


Note that this mechanism is required to solve both the write 
timeliness problem and the write livelock problem. A simpler 
solution exists in which timeliness is not solved but EV5 will 
never hang due to livelock. I’11 discuss this later. 


The external interface implements two address CAMs with 
associated logic (CAMO and CAM1). The CAMs load from and 
compare to address bits 12:6 of the address sent by EV5 with 
memory access commands. Each CAM has an associated valid bit, a 
fill number bit, a fill pending bit, and an associated three-bit 
counter. The CAMs are loaded and validated as a side effect of 
some READ-MOD-MISSn commands, and are invalidated as a side 
effect of other READ-MISS-MOD commands or WRITE-BLOCK commands. 
Here are the details: 


1. At reset, invalidate the CAMs and clear their counters. 


2. If both CAMs are invalid, load CAMO with each 
READ-MISS-MODO and CAM1 with each READ-MISS-MOD1, but only 
if those commands are CACKed. Set the fill pending bit and 
record the fill number. 


3. If exactly one CAM is valid, load the other on every 
READ-MISS-MODn that doesn’t hit in the valid CAM, recording 
the £111 number and setting the fill pending bit as above. 


4. If both CAMs are valid, every READ-MISS-MODn should hit 
one or the other. Otherwise it is an error. 


5. Validate a CAM on the corresponding fill (fill pending 
set and fill number matches), if and only if the fill is 
SHARED. Clear the fill pending bit on every corresponding 
fill. 


6. If a READ-MISS-MODn is issued which hits in a valid CAM 
(valid and matches in address bits <12:6>), increment the 
associated counter (except don’t increment the counter if 
LOCKOUT is asserted on the bus). Record the new fill number 
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and set the fill pending bit. This all occurs only if the 
command is CACKed. 


7. If a WRITE-BLOCK is issued that hits in a valid CAM, 
invalidate the CAM and clear the counter, but only if the 
WRITE-BLOCK is CACKed. 


8. If a fill corresponding to a valid CAM occurs (fill 
pending and fill number matches) and the fill is NOT-SHARED, 
invalidate the CAM. (Assumes NOT-SHARED => DIRTY in this 
case.) 


9. If a valid CAM’s counter reaches 7, assert LOCKOUT. From 
here on, the counter shouldn’t increment, so LOCKOUT should 
stay asserted until the CAM is invalidated (meaning the 
corresponding write operation finished). 


x*x*k Digital Confidential **** 


10. If a new write operation begins while the CPU is 
asserting LOCKOUT, the counter must not be incremented. 
Otherwise there is no guarantee this CPU ever stops assertin 


LOCKOUT. 


Necessary assumptions 


This proposal is based on certain assumptions about EV5. These a 


1. EV5 may only process two writes that involve the Bcache 
or external environment. In particular, if EV5 is working 
on two writes which require READ-MISS-MODn, READ-MISS-STCn, 
SET-DIRTYs, or WRITE-BLOCK to complete, EV5 will never begin 
processing a third write that requires any of these things 
until one of the current writes is completed. Completion 
means the write hit dirty, not-shared in the Scache or did a 
WRITE-BLOCK that was CACKed. 


2. EV5 never processes more than one write at a time with a 
particular value of address bits <12:6>. I.e., if one write 
is being processed and another is begun which matches in 
address<12:6>, the second will be retried internally and 
will never lead to READ-MISS-MODn, READ-MISS-STCn, 
SET-DIRTYs, or WRITE-BLOCK. 


3. After a fill for READ-MISS-MODn which fills not-shared, 
dirty, EV5 will guarantee to complete the corresponding 
write (given also that fills are always in the same order as 
the commands were issued by EV5 on the pin-bus). This 
guarantee specifically covers the case in which a fill is 
closely followed by an INVALIDATE to the same block. 
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4. EV5 will complete a STx_C after a fill for READ-MIS STCn 
regardless of the fill state (shared or not-shared) if the 
lock flag is not set. This guarantee specifically covers 
the case in which a fill is closely followed by an 
INVALIDATE to the same block. (If the fill is shared, there 
is no guarantee, but only an invalidate can prevent 
completion and this will reset the lock flag.). 


An assumption about Turbo-Laser/TLEP required for this proposal 
is that fills to EV5 for READ-MISS-MODn will either be shared, 
not-dirty or not-shared, dirty. Not-shared, not-dirty fills 
require rethinking detail item 8 to include a SET DIRTY that 
completes. 


One rule that EV5 requires is that a READ-DIRTY, INVALIDATE, or 
SET-SHARED for the same block just filled to EV5 must not follow 
too closely after the fill for a READ-MISS-MODn that filled 
dirty, not-shared. This is needed to guarantee that EV5 will 
complete one write. We are currently evaluating a rule for the 
number of CPU cycles after the last Sysclock DACK cycle before 
any of these system commands for the same block are allowed to 
be sent. 
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Errors 


The only error I can anticipate is that a fill error which 
caused the environment to use CFAIL*~CACK to reset EV5 will 
cause EV5 to "drop" all writes. This even should invalidate the 
CAMs and clear their counters. 


Other errors can be detected. Perhaps machine check interrupt 
is the best way to handle these. 


A long timeout on LOCKOUT may be useful, but a short timeout may 
prevent LOCKOUT from accomplishing its purpose. 


An Alternative 


An alternative implementation with exactly one CAM will (I 
think) eventually guarantee EV5 finishes all pending writes 
provided EV5 won’t start new writes past a certain point (i.e., 
the WMB effect). This scheme doesn’t solve timeliness in 
unusual cases (cases with two antagonist CPUs writing two 
different blocks continuously). PALcode could insert a WMB in 
interrupt flows, so every timer interrupt or other interrupt 
could force "stuck" writes to complete. A waiver from the ALPHA 
Architecture Board would be needed. A simpler, cheaper scheme 
covering all reasonable circumstances coupled with eventual 
guarantees via timer interrupt seems quite reasonable to me. 


